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Introduction 


Introduction 


In earlier units an important distinction has been drawn between sample 
quantities on the one hand — values calculated from data, such as the 
sample mean and the sample standard deviation — and corresponding 
population quantities on the other. The latter arise when a statistical 
model is assumed to be an adequate representation of the underlying 
variation in the population. Usually the model family is specified 
(binomial, Poisson, normal, ...), but the indexing parameters (the 
binomial probability p, the Poisson parameter A, the normal mean pu and 
normal variance o”, and so on) might be unknown; indeed, they usually 
will be unknown. Often one of the main reasons for collecting data is to 
estimate, from a sample, the value(s) of the model parameter(s). 


Many different sample quantities can be used to estimate population 
parameters. For example, if there are x successes in a sample of n 
independent Bernoulli trials, each with probability p of success, then the 
proportion of successes in the sample, x/n, can be used to estimate p, the 
unknown parameter of the binomial distribution. Similarly, for normal 
distributions, the sample mean 7 can be used as an estimate of the 
population mean y, and the sample variance s? as an estimate of the 
population variance o^. 


Sometimes, however, it is not totally clear which sample quantity should 
be used to estimate a population parameter. For instance, should the 
sample mean or the sample variance be used to estimate the Poisson 
parameter A, as A is both the population mean and the population 
variance of this distribution? Similarly, for a normal population, y is the 
population median as well as the population mean, so should the sample 
mean or the sample median be used as an estimate of u? 


In this unit, the focus is on point estimation: the task of providing a single Point estimation contrasts with 

value to estimate a population parameter. To determine a point estimate, interval estimation, or the 

or just estimate for short, an estimating formula, or estimator, is applied providing of an interval of values 
| to estimate a parameter; interval 

to the data available. For example, the formula for a sample mean — an ims ion ie the cab ce, oe 

estimator — might be applied to a sample to obtain the numerical value of Upit 8. 

the sample mean — an estimate — which is a quantity that might be used to 


estimate the population mean — a population parameter. 


However, different samples can produce different estimates for the same 
parameter using the same estimator. For example, when using the sample 
mean as an estimator of a population mean, you already know that the 
observed value of the sample mean varies from sample to sample. This 
means that the sample mean will not, in general, equal the population 
mean. (Even if one of the various sample means happens to equal the 
population mean, the others almost certainly won’t. What is more, we 
wouldn’t know which of the sample means is equal to the population 
mean.) It is, however, desirable to obtain estimators that generally have 
values which are as close to the true value of the parameter of interest as 
possible. In Section 1, a number of examples are examined in which data 
have been collected on a random variable and a population parameter is to 
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be estimated from the data. Attractive properties of an estimator are also 
considered and illustrated with examples. 


In Section 2, a very important approach to parameter estimation, with 
broad application, is introduced. This approach is the method of maximum 
likelihood, or just maximum likelihood for short. For any given probability 
distribution, maximum likelihood is a general method for obtaining an 
estimator of the parameter of the distribution. Differentiation is the usual 
technique used in determining a maximum likelihood estimator, so in 
Subsection 3.1 we review the results on differentiation that we need here. 
The results of Subsection 3.1 are applied in Subsection 3.2 to obtain 
maximum likelihood estimates of parameters from samples of data. More 
generally, maximum likelihood estimators are derived in Subsection 4.1, 
and some properties of the method of maximum likelihood are given in 
Subsection 4.2. 


1 Principles of point estimation 


When some representative statistical model has been proposed for the 
variation observed in a random variable, point estimation is the process 
of using the data available to estimate the unknown value of the parameter 
(or parameters) of the model. The single number obtained from the data is 
a (point) estimate of the parameter. In Subsection 1.1, examples of 
(point) estimators — formulas which deliver (point) estimates when 
applied to the data — are given; and in Subsection 1.2, the question of 
‘What makes a good estimator?’ is addressed. In Subsection 1.3, you will 
use your computer to explore and compare some estimators. 


1.1 Point estimators 


We will start by considering an example of an estimator. 


Example 1 Counts of the leech Helobdella 


In a research study, 103 water samples were collected from a lake. To avoid 
confusion with our different use of the word ‘sample’ everywhere else, let 
us call these water samples ‘volumes’. The number of specimens of the 
leech Helobdella contained in each volume was counted. More than half of 
the volumes collected (58 of them) were free of this contamination, but all 
the other volumes contained at least one leech, and three contained five or 
more leeches. Table 1 gives the frequencies of the different counts. 


Table 1 Counts of the leech Helobdella in 103 water volumes 


Count 0 1 2 3 4 5 6 7 8 D9 
Frequency 58 25 13 2 2 1 1 0 1 0 


(Source: Jeffers, J.N.R. (1978) An Introduction to Systems Analysis with 
Ecological Applications, London, Edward Arnold) 


1 Principles of point estimation 


A plausible model for the observed variation in the counts is a Poisson 
distribution. Since the parameter À of a Poisson distribution is the mean 
of the distribution, a natural estimate of À is the sample mean %. In this 


case, the sample mean is 
Ox 58+1x25+2x13+---+8x1_ 84 ~ 0.816 
58+25+13+...+1 ~ 10300 


So 0.816 is a point estimate of the unknown Poisson mean À. 


t= 





The dataset was presented in frequency form in Table 1 and the sample 
mean, x, was calculated in the usual way using those frequencies. Those 
frequencies were, in turn, obtained from the raw data 21, %2,..., 2103, 
where the xs denote the number of Helobdella leeches in each of the 103 
water volumes. In fact, 


The freshwater leech Helobdella 
robusta 








T1 2, T2 0, T3 1, TA 4, T5 0, ss L103 = 0. 
In terms of the raw data, the sample mean is 


__itaæ2t..+amos  2+0+4+---+0 84 
== = —___ — __ x 0.816, 
103 103 103 
as before. The observed values that comprise the raw data can be thought 
of as a particular collection of 103 independent observations 


X 1, X2,...,X103 on the random variable X ~ Poisson(A). Let 


LR RS 
E 103 l 


Then X is a random variable whose observed value (0.816) is our estimate 
of A; X is an estimator of À. 


As has been mentioned before, it is often useful, as here, to distinguish 
between random variables, by denoting them by upper-case letters such 
as X, and their observed, sample, values, by denoting them by the 
corresponding lower-case letters, in this case x. 





In traditional printing using 
| | movable type letters, the letters 
In Example 1, a procedure or estimating formula was used which may be were stored in a ‘type case’, with 





expressed as follows. Collect a total of n water volumes and count the small letters in the ‘lower case’ 
numbers of leeches X1, X9,...,Xn in the volumes; find the total number of and capital letters in the ‘upper 
leeches Xj + X2 +---+ Xn, and divide this number by n to obtain the case 


average number of leeches in a volume of water. In other words, the 
formula used to estimate the parameter À of the population model 
Poisson(A) is the random variable 

wz Xit Xz+: + Xn 


X 
n 


It is called an estimator of À. With different datasets from the same 
situation, different values of the estimator would be obtained; these are 
estimates of À. For example, for the particular dataset in Example 1, the 
estimate takes the value 84/103 œ 0.816. 


Example 1 illustrates the following essential features of a point estimation 
problem. 
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Strictly speaking, a ‘hat’ symbol 
is called a ‘caret’ symbol. No, 
not this sort of carat ... 





. nor this sort of carrot 


An adenoma is a benign tumour 
originating in a gland. 
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e Some data: if we have no data, then we cannot make an estimate. 


e A probability model for the way the data were generated. In 
Example 1, X ~ Poisson(À). 


e The model involves a parameter whose value is unknown: this is the 
value we wish to estimate. In Example 1, the parameter is À. 


e An estimating formula or estimator of the parameter: this formula is 
obtained from the model (and includes symbols for the data rather 
than their numerical values). In Example 1, the estimator is 


X = (Xı + Xo4+---+ X103)/103. 
e The value of the estimator given by entering the data into the 


estimating formula, that is, the estimate for the parameter. In 
Example 1, the estimate is Z = (x1 + £2 +--- + £z103)/103 © 0.816. 


We will now introduce an important piece of notation. 


Hat notation 


It is common practice to use a ‘hat’ symbol to indicate an estimate of 
a parameter. So estimates of u and p (say) are denoted by f and p 
(pronounced ‘mew-hat’ and ‘p-hat’), respectively. 


The hat notation is also used for an estimator. The estimator of y in 
Example 1 is f = X = (Xı + X2 +--+ + Xn)/n. It would be unwieldy to 
develop a separate notation to distinguish estimates and estimators, so this 
is not done. 


Examples 2 and 3 further illustrate the features of a point estimation 
problem. 





Example 2 Alveolar—bronchiolar adenomas in mice 


In a research experiment into the incidence of alveolar—bronchiolar 
(respiratory) adenomas in mice, several historical datasets on groups of 
mice were examined. (Source: Tamura, R.N. and Young, S.S. (1987) ‘A 
stabilised moment estimator for the beta-binomial distribution’, 
Biometrics, vol. 43, no. 4, pp. 813-24.) One of the groups contained 

54 mice. After examination, six of the 54 mice were found to have 
adenomas. These are the data from the first group. Assuming 
independence between mice, the experiment consists of observing an 
outcome x on a binomial random variable X, which represents the number 
of mice in the sample who have adenomas. The probability model is 

X ~ B(n,p) where n is the sample size and p is the unknown parameter 
that we wish to estimate; p is the probability that a mouse has an 
adenoma. The obvious estimator (estimating formula) for p is the 
proportion of mice in the sample who have adenomas, 


p= AIK. 


For this first group, the number observed was x = 6 and the sample size 
was n = 54, so the estimate of p is p = 6/54 = 1/9, or about 0.111. 


1 Principles of point estimation 


A different group might have involved a different number, n, of subjects 
but used the same experimental design. Making the same assumptions, our 
estimating procedure would again be: observe the value of the random 
variable X and divide this observed value by n. That is, X/n is again the 
estimator of p although, obviously, its value will generally differ in different 
experiments. 


Indeed, the experiment involved altogether 23 groups of mice. 
Examination of the other 22 groups resulted in different estimates for the 
proportion of affected mice in the wider population, from as low as 

0/20 = 0 in one sample of 20, through a variety of different values such as 
4/47 ~ 0.085, to as high as 4/20 = 0.2. 


Notice that the 23 different groups of mice (samples) are assumed to be 
similar so that the observed proportions of alveolar—bronchiolar adenomas 
in each group can all be viewed as different estimates of the overall 
proportion of alveolar—bronchiolar adenomas in an appropriate underlying 
population of mice. 





Example 3 concerns the normal distribution. This differs from Examples 1 
and 2 in that the distribution is continuous, rather than discrete, and 
involves two unknown parameters rather than just one. Nevertheless, the 
key features of the estimation problem are the same. 





Example 3 Chest measurements of nineteenth-century Scottish soldiers 


The chest circumference of each of 5732 nineteenth-century Scottish 
soldiers was measured (in inches, to the nearest inch). These measurements 
are the data. A histogram of the data suggests that they might be 
observations from a normal distribution. Hence the probability model has 
the form X ~ N(j,07), where X denotes the chest circumference in inches 
of a nineteenth-century Scottish soldier. The data are observations on the 
random variables X1,X2,..., X5732, and the unknown parameters are u 
and o”. The obvious estimators of these quantities are the sample mean 


1 n 
=. 2 X; 
= 
and the sample variance 


1 


n—1 


n 


SX = ay, 


i=1 


= 





where n = 5732. Computing the sample mean and variance for the 
observed data values gives 7 ~ 39.8489 inches and s? + 4.2989 inches?, so 
these are the estimates of u and o°. 





1.2 What makes a good estimator? 


An estimator is a random variable, so it has a probability distribution. 
This distribution is called the sampling distribution of the estimator. 


The group size n varies, but we 
know its value for any group, so 
it is not a random variable. It is 
an observed (or perhaps chosen) 
value, so it is denoted by a 
lower-case letter. 


The data are given in Table 2 of 
Unit 6 and a histogram of the 
data in Figure 2 of Unit 6. 


Later in Unit 6, the normal 
distribution with u = 40 and 
g? = 4 was used as a simplified 
version of this model for these 
data. 
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We investigated the sampling 
distribution of the sample mean 
and sample total in Section 6 of 
Unit 6. 


It is much clearer to use 
different symbols for values from 
a different study. 


The unknown parameter in 
estimation problems is often 
denoted 0, which is the Greek 
lower-case letter theta, 
pronounced ‘theta’. 
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Looking at the mean and variance of its sampling distribution can give a 
good idea of how well the estimator in question can be expected to 
perform. 


Activity 1 Mean and variance of a Poisson sample mean 


Example 1 concerned a research study about the number of leeches of the 
genus Helobdella that were found in each of 103 water volumes from a lake. 
This study resulted in an estimate of the Poisson parameter À of 

= 0.816, which is an observation on the random variable X. Now 
suppose that a similar study was carried out under similar circumstances, 
but that only 48 water volumes were collected. Denote the number of 
leeches in these volumes by yj, yo,...,yag. Assuming the same Poisson 
model, if we again use the sample mean to estimate the unknown 
parameter A, our estimate from this second study would be 


Yi Y2 +: + Yas 
= = 
This is an observation on the random variable 
Fees Yas 
48 ? 
our estimator of À in this case. Notice that both X and Y are estimators 
of the same parameter À. 


e| 


Y = 


What is the variance of a Poisson distribution whose mean is À? Write 
down in terms of the unknown parameter À the mean and variance of the 
random variable X and of the random variable Y. Hint: recall from Unit 6 
that for any random sample taken from a population with mean u and 
variance o?, the mean and variance of the sample mean Z, say, are given 
by E(Z) = u and V (Z) = o?/n, respectively, where n is the sample size. 


So what, in particular, does the mean of an estimator tell us about the 
performance of the estimator? Some estimators may tend to give estimates 
that are too high on average, while others may tend to give estimates that 
are too low on average. (In science and engineering, the behaviour of the 
average of an estimator is termed its ‘accuracy’.) Clearly, an estimator 
that avoids being ‘wrong’ on average is desirable. Such estimators are said 
to be unbiased. More specifically, an estimator is said to be unbiased if its 
expected value is equal to the parameter being estimated, that is, if an 
estimator gives estimates that are on average equal to the parameter being 
estimated. Unbiasedness is a theoretical property of estimators under an 
assumed model. 


Unbiased estimators 


Suppose that ĝ is an estimator of a parameter 0. Then 0 is an 
unbiased estimator of 0 if 


A 


E(6) = 6. 


1 Principles of point estimation 





Example 4 Unbiasedness of the sample mean 


Recall from Subsection 6.1 of Unit 6 that whatever the underlying 
population distribution, if X is the sample mean and p is the population 
mean, then 


E(X) = n. 
We can therefore formalise this aspect of the notion that the sample mean 


is a good estimator of the population mean: the sample mean is an 
unbiased estimator of the population mean. 


So in the case of a Poisson(A) distribution, for instance, the sample mean 
is, on these grounds, a reasonable choice of estimator of the Poisson 
parameter, À, since À is the population mean in this case. 





The result found in Example 4 is important and worth highlighting: 


The sample mean is an unbiased estimator of the population mean. 


Activity 2 An unbiased estimator of the binomial parameter 


Suppose that the random variable X follows a binomial distribution 
B(n, p). Show that p= X/n is an unbiased estimator of p. 


Bias of estimators 


Suppose again that @ is an estimator of a parameter 0. An estimator 0 
is biased if it is not unbiased. In this case, the bias of 0 is 


E(0) — 8. 


An estimator is said to be positively biased if 


E(60) > 6; 
such an estimator gives estimates that are too high on average. 


Similarly, an estimator is said to be negatively biased if The one that got away ... bias 
A in estimation of fish size? 


EO) < 6; 


such an estimator gives estimates that are too low on average. 





Example 5 A biased estimator of the geometric parameter 


Suppose that a random variable X follows the geometric distribution with 
parameter p. From Unit 4, the mean of the geometric distribution is 
pt = 1/p. We know, therefore, from Example 4, that X itself, being the 
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It is generally true that 


E(1/X) # 1/E(X). In this case, 


if you really want to know, 
E(1/X) = —plogp/(1—p), and 
P= 1/X is positively biased. 
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sample mean of this sample of size n = 1, is an unbiased estimator of the 
population mean ju. 


However, the above concerns estimation of the quantity 1/p, not of the 
parameter p itself. No problem, one might say — use the estimator 1/X to 
estimate the parameter p. And yes, this is a reasonable thing to do. But is 
p = 1/X an unbiased estimator of p? The answer turns out to be no. In 
fact, E(P) is a complicated function of p that it is far beyond the scope of 
this module to derive. However, this complicated function of p is certainly 
not equal to p itself, so p= 1/X is a biased estimator of p. 





Unbiasedness is not the be all and end all of desired properties of a good 
estimator. It is also preferable for an estimator to have low variance. 
Then, estimates resulting from statistical experiments can be expected to 
almost always be quite close to the parameter that is being estimated 
when the bias is also low. (In science and engineering, this property is that 
of high ‘precision’.) In Activity 1, for example, you found the mean and 
variance of the estimators À = = X and do = = Y of the Poisson parameter À 
based on two different samples of data from the same situation. Both À 
and do are unbiased estimators of À as their means are À, but À has the 
smaller variance because it is based on a larger sample. So. in that sense, 
Ai is a better estimator of À than is Às: 


It is not unreasonable to sometimes use a biased estimator (for example, 
like the one in Example 5), provided that both its bias is small and its 
variance is low. 


Other desirable qualities of an estimator relate to its properties as the 
sample size, n, increases. The performance of an estimator often improves 
as the sample size increases: its variance usually decreases, and its bias, if 
any, may well decrease also. One might expect this kind of performance on 
the basis that the more data you have, the better you should be able to 
estimate the parameter(s) of interest. 


Activity 3 The mean of a normal distribution 


Suppose that a random sample of twelve observations is taken from the 
normal distribution, N (p, 25). 


(a) Is the sample mean an unbiased estimator of u? What is its variance? 


(b) Would the variance of the sample mean decrease if the sample size 
were increased? 


In any given estimation problem, there is not always one clear estimator to 
use: there may be several possible alternative estimators that could be 
employed. The question that naturally arises is: ‘Which estimator is likely 
to lead to “better” estimates?’ This is illustrated in Example 6 and 
Activity 4. 


1 Principles of point estimation 


Example 6 Observations with different variances 


Suppose that independent observations X1, X2 and X3 come from normal 
distributions that have the same mean, u, but whose variances are 1, 4 
and 9. That is, 


X1~N(p,1), Xo~ N(u,4) and X~ N(u,9). 
One possible estimator of ju is 
Pa = 3(X1 + X2 + X3). 
Then 
E(ji;) = E{3(X1 + X2 + X3)} 
= 3E(X1 + X2 + X3) 
= 3{E(X1) + E(X2) + E(Xs)} 
= 3(H+ H+ p) = n. 


Here, we have first used E(aX +b) =a E(X) +b with a= 4, b = 0 and 
then E(X1 + X2 +- + Xn) = E(X1) + E(X2) +--+ + E(Xn) with n = 3; 
both results are initially from Unit 4. We have shown that the expected 
value of the estimator ji, is simply the unknown parameter u. So fi, is an 
unbiased estimator of u. 


However, let us consider another possible estimator, namely 
fin = 7p (36X1 + 9X2 + 4X3). 


This estimator gives greater weight (more importance) to Xı than to the 
other variables, which seems appropriate as Xı has a smaller variance than 
the other variables and, in consequence, is likely to be closer to u than the 
other variables. Similarly, fi, gives less weight to X3 than to the other 
variables, which also seems appropriate, as X3 has the largest variance. We 
have that 


E(fig) = E{ 7 (36X1 + 9X2 + 4X3)} 
= $ E(36.X1 + 9X2 + 4X3). 


But the expectation term equals E(36X;) + E(9X2) + E(4X3) by applying 
EM + Yo + Y3), where Yı = 36X1, Yo = 9X2 and Y3 = 4X3. So 


E(fig) = 35{E(86X1) + E(9X2) + E(4X3)} 
= 7 {36E(X1) + 9E(X2) + 4E(X3)} 
+ (36 +9u+ 4) = p. 





Hence fə is also an unbiased estimator of y. 


Thus both ji, and fi, have the desirable property of being unbiased 
estimators of u. However, the values that they take are unlikely to be 
identical. For example, if the data values are x, = 5.2, ro = 4.7 and 
x3 = 4.8, then 

fy = 4(5.2 + 4.7 + 4.8) = 4.9 
while 

f = (36 x 5.2 + 9 x 4.7 + 4 x 4.8) ~ 5.08. 


In general, 
E(aiX1 + a2 X92 + a3X3) = 
a E(X1) + a2 E(X2) + a3E(X3). 
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The third line follows from the 
second because 

V(Yi + Yo + ¥3) = 

V(Y) + V(¥2) + V (Ys), where 
Yı = 36X1, Yo = 9X2, Y3 = 4X3 
(and the Ys are independent). 
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Should we prefer ji, or fiz as an estimator of u? 


As noted earlier, it is preferable if an estimator has a small variance as well 
as being unbiased. Hence we should examine the variances of ji; and fo. 


Now, 

V (fix) = V{4 (X1 + X2 + X3)} 
L ESSA 
{V (X1) + V (X2) + V(X3)} 
(1+4+9) = # ~ 1.56. 


( 
1 
79 
1 
g 


This time, we have first used V (aX + b) = a? V(X) with a = - and then 
V(X + Xo +... + Xp) = V(X) + V(X2) +--- + V(X,) with n = 3 since 
X1, X2 and X3 are independent; again, results are initially from Unit 4. 


For fi», we have 
V (fig) = V{qg(36X1 + 9X2 + 4X3)} 
= (4)° V (36X1 + 9X2 + 4X3) 
= ($) {V(36X1) + V (9X2) + V(AX3)} 
= (4)° {86° V(X1) + 92 V(X2) + 4? V(X3)} 
= sh, (1296 x 1+ 81 x 4+ 16 x 9) = HS ~ 0.73. 


Thus the variance of ji, is much smaller than the variance of f4, so fə is 
the better estimator. 





In forming ji, in Example 6, the coefficients of X1, Xə and Xz were made 
proportional to 36, 9 and 4 because 


36 V(X1) =9V(X2) = 4V (X3), 


all being equal to 36. It can be shown, but will not be so here, that this 
choice means that fis has a smaller variance than any other unbiased 
estimator of u that is linear in X1, X2 and X3. You can, however, check 
this claim against another possible estimator that is linear in X1, X2 and 
X3 in the following activity. 


Activity 4 Another estimator for observations with different 
variances 
For X1, X2 and X3 as defined in Example 6, consider the estimator 
fis = 4 (6X1 + 3X2 + 2X3). 


(These coefficients satisfy 6 S(X1) = 3 S(X2) = 2 S(X3) = 6, where S 
denotes standard deviation.) 


(a) Show that fiz is an unbiased estimator of p. 
(b) Calculate V (fiz), and hence verify that V (f3) is greater than V (fə). 
(c) Which of fi and fg is the better estimator of u? 


1 Principles of point estimation 


The examples in this subsection have shown that it can sometimes be 
straightforward to determine whether an estimator is unbiased and to 
calculate its variance. These qualities can be used to choose between 

alternative estimators. 


To finish the subsection, we give an example in which obtaining an 
unbiased estimator involves somewhat trickier mathematics. You will not 
be expected to reproduce all the algebraic details, and can ignore those in 
the accompanying screencast if you wish. However, the results of the 
example are important, as they relate to the task of obtaining an unbiased 
estimator of a population variance. 





Example 7 An unbiased estimator of the population variance 


By definition, if y and o? are the population mean and the population 
variance, then 


2 2 
o? = E[(X - y). 
Suppose we want to estimate g? from a random sample of n observations, 


X1,X2,...,Xn. We estimate u = E(X) by the sample mean 
X = $] X;/n so, by analogy, an obvious estimator of o° is 


1 = 
W=- dx ZX). 
i=l 
However, the usual estimator of a population variance is the sample 
variance, 


n 


pact i S (Xi - X}. 


n — : 
i=1 





To understand why this latter estimator is generally preferred, we will first 
determine the expected value of $>; (X; — X)?, which occurs in both 
estimators. From this, we will be able to conclude that S$? is an unbiased 
estimator of o? while W is a biased estimator, which is the reason that S? 
is the preferable estimator in most situations. 


Screencast 7.1 goes through the mathematical manipulations involved in 
showing that 


efyn- xy) = (n—1)o?. (1) 
i=1 


As already indicated, this screencast is optional. 
Screencast 7.1 Verifying Equation (1) (optional) 


It follows from Equation (1) that 
1 n 


E(S?) ei (X; -x} = ee -xy} 


i=l 








T 
= x (n — 1) 0° = 0°. 
n—1 








How sharp were prehistoric 
spear heads? A case study in 
point estimation? 


The non-standard notation W 


will help to keep things clearer 


later. 


® 
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So S? is an unbiased estimator of o?, as stated above. 





Activity 5 A biased estimator of the population variance 


(a) Use Equation (1) to show that 


1 = 
W=- x) 
a ) 


is a biased estimator of o?. 


(b) Calculate the bias of W, and interpret what the bias tells us. 


1.3 Exploring and comparing estimators by 
computer 


The work in this subsection consists of a chapter of Computer Book B. 
You will use computer animations to compare the properties of different 
estimators of particular parameters. 


Refer to Chapter 3 of Computer Book B for the work in this 
subsection. 


Exercise on Section 1 





Exercise 1 Weight gains of pigs 


From prior experience, it is known that the weight gain of pigs on a 
high-protein diet has a standard deviation of 9.5 kg; on a low-protein diet 
the standard deviation is 8.2 kg. A new supplement is added to the diets. 
With this supplement, let a pig’s average weight gain be y, on the 
high-protein diet and u on the low-protein diet. To estimate 44 — Hə, the 
difference in average weight gain on the two diets, eleven pigs are put on 
the high-protein diet and ten pigs are put on the low-protein diet; both 
diets include the supplement. The supplement does not change the 
standard deviations of weight gain. 


A natural estimator of pı — Hs is X1 — X2, where X; is the mean weight 
gain of the eleven pigs on the high-protein diet and Xə is the mean weight 
gain of the ten pigs on the low-protein diet. The estimators X; and X> are 
independent because they are based on different pigs. 


(a) Show that X; — X> is an unbiased estimator of pı — Hə. 
(b) Find the variance of the estimator X; — X2. 


(c) If one further pig were available for the experiment, what would be the 
variance of the estimator if the pig were put on the high-protein diet? 


2 The method of maximum likelihood 


What would be the variance of the estimator if the pig were put on the 
low-protein diet? Which of the two diets should the pig be put on in 
order to get the best estimate of u — Ho? (In either case, the 
difference between sample means remains an unbiased estimator of the 
difference between population means.) 





2 The method of maximum likelihood 


In the previous section, estimates and estimators were introduced and we 
discussed what makes a good estimator. In this section, we will introduce 
one of the most widely used methods for finding estimates and estimators. 


In our day-to-day lives we regularly guess the most likely explanation for 
things that happen. If somebody walks past your window holding an open 
umbrella, the most likely reason is that it is raining. If you press a light 
switch and nothing happens, you might conclude that ‘the light bulb has 
probably gone’, because that seems the most likely reason for the failure. 
If many people are waiting at a bus stop, you might conclude that a bus 
should come soon because it seems likely that one has not come for some 
time. 





Í ; : : : it migh i 
The approach of asking ‘What could best explain this?’ or ‘What is most De D e ONE 


likely?’ can be used to estimate the unknown parameter(s) of a probability 

model. Given a set of observations, we can ask: ‘What value of the 

parameter is most likely to have given rise to these observations?’ To this 

end, we will define a likelihood function which encapsulates the probability See Subsection 2.1 for a detailed 
of the observations arising from the model for each of the possible values of explanation. 
its parameter. And then we will choose our estimate of the parameter to be 

the value which maximises this likelihood. This value is referred to as the 

mazimum likelihood estimate of the parameter. The whole methodology of 

forming estimators in this way is called maximum likelihood 

estimation. It is arguably the most important method of constructing 

estimators: it is highly versatile — it can be used in an enormous variety of 

situations — and the estimators that it yields have good properties. 


The basic idea underlying maximum likelihood estimation is introduced in 
detail in Subsection 2.1. Partly for notational convenience, discrete and 
continuous distributions are considered separately; but the underlying 
principle is essentially the same for both forms of distribution. Maximum 
likelihood estimation for discrete distributions is discussed in 

Subsection 2.1 and for continuous distributions in Subsection 2.2. To 
introduce ideas, in this section, we use graphs to obtain approximate 
values of maximum likelihood estimates, and summarise how far we have 
got to in Subsection 2.3; calculus will then be used in Sections 3 and 4 to 
obtain estimates and estimators exactly. 
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2.1 Discrete probability models 


In this subsection, it is assumed that observations are collected on a 
discrete random variable X. The random variable X therefore has a 
probability mass function p(x) = P(X = x). Suppose that the variation in 
observations on X is to be modelled by a probability distribution indexed 
by a single unknown parameter 0. To emphasise that there is a parameter 
8 involved, the probability mass function may be written p(x; 0). So 


P(X = x) = p(a;4) 
for all values x in the range of X. 


Suppose initially that our sample consists of just a single observation x 
from the discrete probability model with p.m.f. p(x; 0). This means that, 
before collecting the sample, the probability of observing the sample value 
x is 

p(x; 0). (2) 


But this probability depends on the value of 0; if 6 = 01, say, then p(z; 61) 
takes one value, while if 0 = 62, say, then p(x; 02) takes a different value. 
After collecting the sample, we know the value of x but we still don’t know 
the value of 4. So instead of our usual habit of thinking of p(x; 6) as a 
function of x (for fixed but unknown 0), we can now think of p(x; 0) as a 
function of 0 (for fixed and known x). It is this function of 0 that we will 
denote by 


L(0) = p(x; 6) (3) 


and call the likelihood of 0 based on a single observation from a 
discrete model. Here, and in more general situations, we usually 
abbreviate this terminology to just the likelihood function or often just 
the likelihood. 


Having obtained the likelihood of the data value — its probability of arising 
given our model for each value of its parameter 6 — we can choose our 
estimate of 6 to maximise this likelihood function, that is, as the value of 0 
which makes the data value we observed the most probable to have arisen 
under the model. Said again, we choose the maximum likelihood 
estimate, 0, of 6 as the value of 0 which maximises the likelihood function 


L(6). 


Let’s see how this works out for an observation from the binomial 
distribution. 





Example 8 Rolling a biased die 


Suppose a die is rolled ten times and on seven of these rolls it lands 
showing a 5. If 0 is the probability that the die shows a 5 when rolled once, 
then it seems likely that 0 is much greater than i and that the die is 
biased. 


What is the value of 0 that is most likely to give seven 5s in ten rolls? 
Assuming the rolls of the die are independent, the number of 5s has a 
binomial distribution, B(10,0). Hence the probability of observing exactly 


2 The method of maximum likelihood 


seven 5s in ten rolls is 


p(7; 0) = (7) O7(1 — 9ÿ° = = 67(1 — 6)? = 12007(1 — 8). 
T 7!3! 

Now, instead of thinking of p(7;0) as the probability of the sample value 

we observed (i.e. 7), given a fixed (if unknown) value for 6, we turn things 

round and consider p(7; 0) as the likelihood function for 0 given the known 

value, 7, of the observation: 


L(6) = p(7; 0) = 12007(1 — 6)°. 


The method of maximum likelihood involves finding the value of 0 that 
makes this likelihood as large as possible. 


Now, in the binomial model, the parameter 0 can take any value between 0 

and 1. The likelihood, being a function of 0, is therefore a function of a 0.35 
continuous argument, 0, even though the data value (and model) with 

which we are dealing is discrete. Table 2 gives the value of L(0) for some of 








the possible values of 6; Figure 1 plots L(@) as a function of 8 over its 0.24 
entire range (0,1). S 
x 
Table 2 The value of the likelihood for various values of 0 0.14 
0 0 0.2 0.4 0.6 0.7 08 1 
L(0) 0 0.0008 0.0425 0.2150 0.2668 0.2013 0 J Ae 
The table suggests that L(@) increases from 0 at 0 = 0 to a peak at about 0 


0 = 0.7 and then decreases. This is confirmed by Figure 1, in which L(6) is Figure 1 A graph of L(0) 
plotted against 0. The figure shows that the likelihood is maximised when 
0 is approximately 0.7. So 6, the maximum likelihood estimate of 0, is 





approximately 0.7. (In fact, = = 0.7 is the exact value of the maximum You will learn how to find the 

likelihood estimate. ) exact value of a maximum 
likelihood estimate in the next 
section. 


Activity 6 Likelihood for a binomial parameter 


Example 2 concerned adenomas in mice. In one group of 54 mice, six had 
adenomas. Let 0 be the (unknown) proportion of mice in the whole 
population that have adenomas. 


(a) Write down L(6), the likelihood for 0 (based on the above data). 


(b) Evaluate L(6) at 6 = 0.11, and 6 = 0.12 and hence complete the 
following table. 


Table 3 
0 0.09 0.10 0.11 0.12 0.13 
L(6) 0.1484 0.1643 0.1558 


(c) Use the values in the completed table to sketch a graph with L(@) 
plotted against 0 for values of 0 between 0.09 and 0.13. (Even though 
L(6) is calculated for only five values of 9 in part (b), remember that 0 
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can actually take any value between 0 and 1; in this case, L(@) turns 
out to be very small for values of 4 less than 0.09 or greater than 0.13.) 


(d) Find an approximate value for the maximum likelihood estimate of 0. 


So far, so much fuss about nothing, you might think! We have gone 
through all this likelihood rigmarole in Example 8 and Activity 6 just to 
get ‘obvious’ estimates of the probability 0 in the binomial case. Well, yes, 
the estimates might be obvious so far, but the approach remains useful in 
far more complicated situations when the appropriate value of an estimate 
is nothing like so obvious. And it is an encouraging property of the general 
maximum likelihood approach that it does reduce to simple, ‘obvious’ 
estimates in simple cases. 


So, to continue to develop the maximum likelihood methodology, let us 
now consider the more usual situation where, for the purposes of 
estimating the value of 0, a random sample X1, X2,..., Xn of size n, where 
n is greater than 1, is collected. Under the model with p.m.f. p(x; 0), the 
probability that X takes the value +1, say, is p(x1; 0); the probability that 
Xə equals x2 is p(x2; 0); and so on. Although the term ‘random sample’ 
has been used a number of times in the module already, we will now make 
explicit what has largely been implicit before: the observations in a 
random sample are assumed to be independent of one another. It follows 
that the probability that Xı = zı and X2 = x2 is the product of the 
probability that Xı = zı and the probability that Xə = xo: 


P(X, = ti; Xo = Wo) = parse) X pra: 0) 


And, more generally, it follows that the probability that X1 = x1 and 
Xə = 22 and --- and Xn = £n — that is, the probability of obtaining 


£1, T2, ..., Zn as the collection of sample values — is the product of all the 
individual probabilities: 
p(x1:0) x p(x2; 0) X +++ X plan; 0). (4) 


Now, this expression gives the probability that our actual sample arose, 
given the true, but unknown, value of 0. As such, it is the direct extension 
to n > 1 of the probability that our sample arose when n = 1 given in 
Expression (2). 


So, using Expression (4) in place of Expression (2), the argument proceeds 
just as it did before. First, as we do not know @, we cannot be sure what 
the true value of this probability is. However, we can work out the value of 
this probability for various guessed values of 4. In doing this, we are 
treating the probability as a function of 0: we switch things round from 
considering Expression (4) as a function of £1, £2, ..., n for fixed, if 
unknown, 0 to a function of 0 for fixed values of 71, £2,..., £n provided by 
the sample. For each particular value of 0, this function tells us how likely 
we are to obtain our particular sample. So it seems reasonable to estimate 
0 as the value that gives maximum probability to the particular sample 
that actually arose; that is, we should choose 0 to maximise the quantity in 
Expression (4). 


2 The method of maximum likelihood 


The probability in Expression (4) is called the likelihood of 6 for the sample 
L1,22,..-,Lp Or, usually, simply the likelihood, and is denoted by L(@): 


L(8) = p(a1;9) x p(x; 9) x +++ X plan; 0). (5) 


This likelihood, really the likelihood function, is to be considered as a 
function of the unknown parameter 0, and is the extension of Equation (3) 
to the situation where n > 1. Note that the possible values of 0 usually lie 
in some continuous interval regardless of whether the data are discrete or 
continuous. 


The method of maximum likelihood involves answering the question: 
‘What value of 0 maximises the chance of observing the random sample 
that was, in fact, obtained?’ So we define the maximum likelihood estimate 
of 0, denoted by 0, as the value of 0 that maximises the likelihood L(0) 
given by Equation (5). A ubiquitous abbreviation in statistics is the one 
for the maximum likelihood estimate: MLE. 


The method of maximum likelihood in the discrete case is summarised in 
the following box. 


The method of maximum likelihood for discrete data 





If X is a discrete random variable with probability mass function MLE. So famous, there’s even a 
p(x; 0), where @ is an unknown parameter, then the likelihood for the film about it. Or perhaps there 
random sample 21, %2,...,%p» is denoted by L(0) and is given by it’s short for ‘My Little Eye’. 


L(8) = p(a1; 9) x p(x; 9) x +++ x plan; 0). 


The method of maximum likelihood involves finding the value 6 of 0 
that maximises the likelihood L(0). This value is the maximum 
likelihood estimate (MLE) of 0. 


In the next example and the following activities, the method of maximum 
likelihood is used with datasets consisting of more than one observation. 





Example 9 Estimating the parameter of a geometric distribution 


Consider a very small artificial dataset. of three observations, 71 = 3, £2 = 4 
and z3 = 8. Suppose that these are independent observations from a 
geometric distribution with unknown parameter. For consistency with the 
current development, the parameter indexing the geometric distribution 
will temporarily be referred to as 0. (Conventionally, it is denoted by p.) 


The probability mass function for the geometric distribution with 
parameter 0 is 


plz;0) = (1-0) 19 2=1,2,3,.... 
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In forming likelihoods, use is 
often made of ne result that 
q? q i =g t 

where q is any quantity. 





0.0006 5 
0.0004 5 
© 
4 
0.0002 + 
0 
0 0.5 


0 
Figure 2 A graph of L(0) 
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It follows that the likelihood for this particular random sample of size 3 is 
given by 
L(0) = p(z1;0) x p(z2; 0) x p(x3; 0) 
= (1—0)%1719 x (1 — 6)*2-10 x (1 — 0)3-16 
= (1-0) 0% (1-0) 0x (1-0 4 
Z (1 — ‘sas: 
= (1— 8) 26. 


The likelihood L(@) is a function of the unknown parameter 0 which lies 
somewhere between 0 and 1: for different values of 0, the function will take 
different values. We need to find the value of 0 at which this function is 
maximised. Table 4 gives values of the likelihood L(@) for various values 
of 0; L(@) is graphed as a function of all values of 0 in (0,1) in Figure 2. 


Table 4 The values of the likelihood for various values of 0 


0 0 0.2 0.4 0.6 0.8 1 
L(@) 0 0.00054976 0.00013931 0.00000362 0.00000000 0 


These calculations suggest that the likelihood is maximised somewhere 
between 0 = 0 and 6 = 0.4 — possibly at 0 = 0.2 itself. As you can see from 
the graph of the likelihood in Figure 2, the likelihood is maximised when 
the value of ô is approximately 0.2. (In fact, 0.2 is the exact value of 0.) 





Activity 7 Different data from a geometric model 


Consider another very small artificial dataset that can be assumed to come 
from a geometric distribution, with a different value of its parameter 0 
(which we wish to estimate). This dataset consists of the four independent 
observations æ1 = 1, 2 = 2, 73 = 1 and 24 = 3. 








(a) Show that the likelihood for this particular random sample of size 4 is 
given by 
L(0) = (1 — 04 
(b) A graph of the likelihood obtained in part (a) is shown in Figure 3, for 


all 0 in (0,1). Use this figure to find the approximate value of the 
MLE of 6. 


2 The method of maximum likelihood 











Figure 3 A graph of L(0) 


The next activity is different from the previous examples and activities in 
this subsection, in that the probability model being applied (and which is 
indexed by a single unknown parameter) is not one of the standard 
families. However, the principle is the same: a value of the parameter is 
sought that maximises the likelihood for the sample that was actually 
observed. 


Activity 8 The leaves of Indian creeper plants 


The leaves of the Indian creeper plant Pharbitis nil can be variegated or 
unvariegated and, at the same time, faded or unfaded. The resulting four 
leaf types are denoted 0 for unvariegated and unfaded, v for variegated and 
unfaded, f for unvariegated and faded, and vf for variegated and faded. 


In an experiment, plants were crossed. Of 290 offspring plants observed, 
the four types of leaf occurred with frequencies which are given in Table 5. 





A theory allowing for a phenomenon called ‘genetic linkage’ assumes that 
the observations in this experiment might have arisen from a probability A Pharbitis nil hybrid with 
distribution indexed by an unknown parameter 0. According to this variegated leaves 

theory, the different types of leaf have the probabilities given in Table 5. 

Since all these probabilities must be non-negative, something we do know 

about @ is that it must lie between ot and +. (These limits are because 

we must have b +0 > 0 and 5 — 0 > 0; in this model, @ is not itself a 

probability, so might be negative.) 


Table 5 Pharbitis nil model probabilities and observed frequencies 


Type of leaf Unvariegated Variegated Unvariegated Variegated 
and unfaded (0) and unfaded (v) and faded (f) and faded (vf) 

Probability + +0 à —0 à —0 ig +0 

Frequency 187 37 35 31 


(Source: Bailey, N.T.J. (1961) Introduction to the Mathematical Theory of 
Genetic Linkage, Oxford, Clarendon Press, p. 41) 
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Figure 4 A graph of 
L(0) x 10134 for 0 < 0 < 0.1 


Here, f(x) has been rewritten as 
f(x; 0) to emphasise the 
dependence of f on 8. 


In this way, instead of choosing 


@ to maximise the probability of 
the observed data, as in the 
discrete case, we are actually 
choosing 0 to maximise the 
density of the observed data in 
the continuous case. 





Unsuccessful flood defence 


100 


0 0.05 0.1 


(a) Write down the likelihood L(0) associated with this experiment. 


(b) The likelihood function turns out to take extremely small values, so, to 
make it visible, Figure 4 shows L(9) x 101%, Even when rescaled like 
this, the likelihood function takes essentially zero values outside the 
range 0 < 0 < 0.1, so L(@) is plotted only on this range. What is the 
approximate MLE of 6? 


2.2 Continuous probability models 


So far, attention has been restricted to discrete probability models. In this 
subsection, the maximum likelihood approach is developed for continuous 
random variables. 


What is the likelihood of the unknown parameter 0 for a random sample 
T1,292,..., En from a continuous distribution? In the discrete case, we were 
able to say that the likelihood is the product of the probabilities p(x:;; 6). 
However, for continuous random variables, the probability of obtaining any 
particular value, such as 71, is effectively zero, so the probability of 
obtaining any particular sample of values, such as 71, %2,...,2n, is 
effectively zero, also. The key to moving from the discrete case to the 
continuous case is to replace the probability mass function of the discrete 
case with the probability density function in the continuous case. So in the 
continuous case, the likelihood is obtained by replacing the p.m.f. p(x; 0) 
throughout Equation (5) by the p.d.f. f(x;0). Thus, in the continuous 
case, the likelihood may be written as 


L() = f(x1;0) x f(22;0) x ++: x flan; 0). (6) 
The notation L(0) again expresses the fact that the likelihood is thought of 


as a function of 0 (not of the fixed values 21, £2,..., £n). The method of 


maximum likelihood involves finding the value 0 of 0 that maximises this 
likelihood. 





Example 10 Estimating the exponential parameter 


Flood protection was built around a river in a small town to reduce the 
risk of flooding. However, it was not very successful. After its 
construction, the town first flooded five years later, the second flood was 
eight years after that, and the third flood came after a further seven years. 
Suppose that the time between floods is an observation from an 
exponential distribution with parameter 0 > 0, and that the MLE of 6 is 
required. For an exponential distribution with parameter 0, the probability 
density function is 


fan) =te-™, 


Thus, for the random sample x; 
the likelihood of 0 is 


x > 0. 








5, £2 = 8, x3 = 7 from this distribution, 


2 The method of maximum likelihood 


L(0) = f(x1:0) x f(x2: 0) x f(x3; 0) 
= ge 91 x pe 72 x pers 


= de x pe T8 x He 
— g3e7?(5+8+7) — e7209, 


As with discrete probability models, the MLE of 0 can be determined This illustrative example and 
approximately from a graph of L(@) against 6. Again because all values of Activity 9 to follow assume that 
the likelihood are very small, Figure 5 shows the likelihood multiplied by a the risk of flooding is constant 
large factor, this time 10* = 10000, and because even this rescaled a a cen for 
likelihood is still very small for 6 > 0.5, it is plotted only for 0 < 0.5 (recall ae climate change. | 
that @ can actually take any positive value). Figure 5 shows that the MLE 


of 0 is about 0.15. (In fact, 0.15 is the exact value of 6.) 











0 0.1 0.2 0.3 0.4 0.5 


Figure 5 A graph of L(0) x 104 for 0 < 0 < 0.5 





It is quite typical that likelihoods for continuous data take very small 
values. To be able to work with simpler numbers, in Figure 5 (as in 
Figure 4 in the discrete case), the likelihood was rescaled by multiplication 
by a large factor. The amount of this rescaling is arbitrary, and 
unimportant in the sense that the maximiser of the likelihood function is 
in the same place regardless of what (positive) factor the likelihood is 
multiplied by. Also, even after rescaling, the likelihood often remains very 
small for many of the possible values of 4, so in Figures 4 and 5, L(0) was 
plotted only over that interval of values of 0 for which L(0) is reasonably 
large. We will continue to employ this type of rescaling and plotting over a 
limited interval of values of 0 where appropriate in further figures in this 
unit. 


Activity 9 Estimating another exponential parameter 


In another small town with a similar geography and potential for flooding, 
flood protection built many years ago seems to be rather more effective. In 
subsequent years, the town has flooded just twice: once 25 years after 
construction, then again some 31 years after that. Suppose that here too 
the times between floods are observations from an exponential distribution, + 
but with a different value for its parameter 0 > 0, and that the MLE of 0 is These food dences ars holdiog, 


required. protecting properties to the right 
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(a) What is the likelihood for 6 based on this history of flooding? 


(b) Figure 6 shows (a rescaled version of) L(0) for these data. What is the 
approximate MLE of 6? 


E(@) x 10° 











Figure 6 A graph of L(0) x 10° for 0 < 0 < 0.25 


Activity 10 The Rayleigh distribution 


Let X denote wind speed. This is a continuous, positive, random variable 
that is sometimes modelled as following the Rayleigh distribution. The 
probability density function of the Rayleigh distribution is given by 


F(a; 0) = ie x > 0, 


with 0 > 0. (This distribution has various further applications, including a 

natural role in MRI (magnetic resonance imaging) scanning.) 

(a) Suppose that 22.2, 2.8, 4.0, 13.9, 11.7 and 8.3 are a random sample of 
six observations of X taken at the site of a possible wind farm on 
different days, measured in km/h. Show that the likelihood L(6) of 4 is 
given (approximately) by 

335621 _ 2 
L(6) = ne è 457.8/0 | 


(b) Evaluate L(0) at 0 = 8.50 and 0 = 9.00, and hence complete the 
following table giving L(0) for various values of 0. 


Table 6 
0 8.25 8.50 8.75 9.00 9.25 
L(0) x 10° 4.048 4.216 4.059 


(c) Use the values in the table to sketch a plot of L(9) x 10° against 0 
for values of 0 between 8.25 and 9.25. Hence find an approximate 
value for 6. 


2 The method of maximum likelihood 


2.3 The story so far 


Let us start this subsection with some more notation. In the same way 
that >> is used to denote a sum, for example, 


nm 
XO ri = g1 +t ++ in, 
i=1 


so [| is used to denote a product, for example, 
n 
[nsa X Ta X ee X Ln 
i=1 


This allows us to write the likelihood in a slightly more compact way, and 
to remind you of the workings of maximum likelihood estimation for both 
discrete and continuous data. 


The method of maximum likelihood 


If X is a random variable with a distribution with unknown 
parameter 0, then the likelihood of 0 for the random sample 
T1, T2,- - - , Zn, or likelihood for short, is denoted by L(0) and is 
given by 


HE = { 


where p(x; 0) is the probability mass function and f(x; 6) is the 
probability density function. 


IL p(2i;9) if X is discrete 
IL f(x::0) if X is continuous, 


The method of maximum likelihood involves finding the value 6 of 0 
that maximises the likelihood L(0). This value is the maximum 
likelihood estimate (MLE) of 0. 


Let us now make some further remarks about the likelihood. (The second 
one is a reinforcement of something you have seen already.) 


e ‘Likelihood’ has the above specific, technical, meaning to statisticians. 
This meaning is not the same as its everyday use as a synonym for 
‘chance’ or even ‘probability’. 


e ‘The likelihood’ is shorthand for ‘the likelihood function’. The 
likelihood is a function of 4. This represents a turnaround given that 
its constituent parts, probability mass or density functions, are usually 
thought of as functions of ‘x’, indexed by 0. In the likelihood, 0 
becomes the argument of the function while ‘the xs’, that is, 

T1, T2, ..., Zn, are thought of as fixed quantities. 


e = The likelihood is positive: L(0) > 0. This is because all its constituent 
parts, which are multiplied together, are positive. (In the discrete 
case, the probabilities of observed events must be positive. But even 
in the continuous case, the density f at an observed value x; cannot be 
zero.) If your likelihood is zero or negative for any value of 0, you’ve 
made a mistake in its construction! 


II is the Greek upper-case letter 
Pi 


LIKELIHOOD 


The logo of a shoe shop in 
Seattle, USA 
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The man is presumed not to 
have enough practice to improve 
noticeably as he goes along! 
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If you are not confident with how to formulate a likelihood function or are 
not clear on its interpretation, Screencast 7.2 might be helpful to you. 


Screencast 7.2 Formulating and understanding a likelihood 


In this section, we have made plots of L(@), as a function of 6. These plots 
have then been used to determine (approximately) the point at which L(6) 
is maximised: this point gives (approximately) the maximum likelihood 
estimate 0. Exact maximisation methods are, however, really needed. 
These could be numerical or mathematical. Numerical methods comprise 
computational algorithms that hone in, to very high levels of precision, on 
the maximum of the likelihood function. Numerical methods come into 
their own when models become much more complicated and parameters 
much more numerous. We won’t investigate any of them in M248. When 
likelihood functions are relatively simple and the number of parameters is 
small — particularly, when the number of parameters is 1! — their maxima 
can usually be found mathematically, using calculus: the remainder of this 
unit focuses on using calculus to find MLEs. 


Exercise on Section 2 





Exercise 2 Clay pigeon shooting 


Clay pigeon shooting, also known as clay target shooting, consists of 
individuals shooting guns at clay targets that are fired into the air by a 
machine. A man tried this out for the first time, stopping after he had hit 
four targets. He took three shots to first hit a target, one shot to hit the 
next target, and two shots each to hit two further targets. Suppose that 
the number of shots he takes to hit a clay target can be modelled by a 
geometric distribution with parameter 0. 


(a) Given these data, find an expression for L(@), the likelihood of 0. 


(b) Evaluate L(0) at 6 = 0.4 and 0 = 0.5, and hence complete the 
following table. 


Table 7 
0 0.3 0.4 0.5 0.6 0.7 
L(0) 0.0019 0.0033 0.0019 


(c) Use the values in the completed table to draw a rough graph with L(6) 
plotted against 0 for values of 0 between 0.3 and 0.7. 


(d) Use your graph to find an approximate value for the MLE of 6. 





3 Using calculus to find maximum likelihood estimates 


3 Using calculus to find maximum 
likelihood estimates 


In Section 2, graphs were used to determine maximum likelihood 
estimates. More commonly, calculus is used because it yields exact values. 
Also, a graph can be used to determine the estimate only for a particular 
set of data, whereas calculus can be used to derive the formula for a 
maximum likelihood estimator, the general formula that gives the 
maximum likelihood estimate when the observed data values are entered 
into it. In this section, however, we will continue to confine attention to 
maximum likelihood estimates themselves. 


The key to finding maxima is differentiation. In Subsection 3.1, we revise 
results on differentiation that are required in this unit. In Subsection 3.2, These results should already be 
we will use the results to find some maximum likelihood estimates. familiar to you. 


3.1 Differentiation of powers, polynomials and 
exponentials, and their combinations and 


products 
In this unit, we will need to differentiate quantities like the power 2x°, the You integrated powers and 
polynomial polynomials in Subsection 3.1 of 
Unit 2. 


At Be +x? — 5x? + Oe", 


and the exponential function e~°”, and functions made up by combining 


these quantities in particular ways. 


Differentiating powers, polynomials and exponentials 


Let us start by differentiating powers. 


The derivative of a constant times a power 
If f(x) = ax", then the derivative of f(x) is 


d 
af) = f' (2) = Korea. Both notations for the 
4 derivative, < f(x) and f'(x), 
In words, we multiply by the power of x and then reduce the power of will be used in what follows. 


x by 1. Notice that the multiplicative constant a remains unchanged. 


Two important special cases of this are the derivatives of a constant 
and of x itself: 


e since a = ax, it follows that f'(x) = 0 when f(x) =a 


e since az = ax!, it follows that f'(x) =a when f(x) = ax. 


105 


Unit 7 Point estimation 


Example 11 Differentiating powers 

To illustrate applying this rule: 

if f(z) = 4, then f'(x) = 0; 

if f(x) = 3x, then f'(x) = 3; 

if f(x) =42°, then f'(x) = 3 x 429-1 = 1247: 


10 
if f(z) = 5 = 5£7?, then f'(x) = —2 x 5x 2-1 = —10xr73 = at 








1 1 
and if f(x) = 2x = 2x!/2, then f'(x) = 5 * ag @/2)-1 = g71/2 = T 
Activity 11 Differentiating powers 
Find the derivatives of each of the following functions. 
5 3 
(a) 6z? (b) 4x1 (c) PT (d) —4a/x (e) 27x (f) E 
A political map of the world: Now we can extend from powers to polynomials; the key is the method for 


using colours to differentiate 


dealing with a sum of functions. 
powers? 


The derivative of a sum of functions 


Suppose that f(x) = g(x) + h(x) +---+q(x), where 
g(x), h(x), ..., g(a) are any functions of x. Then 


f(a) = gx) + h'(x) +--+ +9 (x). 
That is, the derivative of a sum is the sum of the derivatives. 


More generally, if a, b,...,k are constants and 
f(x) = ag(x) +bh(x) +---+kq(x), then 


f(x) = ag'(x) + bh'(x) +. +Kkd (x). 





Example 12 Differentiating a polynomial 


As a first example, if 


5 d 10 
f'(x) = ge”) a =(=) + a e”) = 1227 — z5 +3. 


Here, we have used the derivatives of the individual power terms already 
found in Example 11. 
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Activity 12 Differentiating polynomials 


Find the derivatives of the following. 
(a) 4+ 3x + x? — 523 + 207 
3 2 


(b) 4-2-5 


The exponential function arises in, for example, the p.m.f. of the Poisson 
distribution and the p.d.f. of the exponential distribution, so we have to be 
able to differentiate it as well. Happily, this is no more difficult than 
differentiating powers of x. 


The derivative of a constant times an exponential function 
If f(x) = aet”, then the derivative of f(x) is 
f(x) = kae*. 


In words, we just multiply by the coefficient of x in the exponential 
function. 


An important special case of this is the derivative of the exponential 
function itself. Since e” corresponds to a = k = 1, it follows that 
f(z) =e? when f(x) =e". 





Example 13 Differentiating exponentials 

To illustrate applying this rule: 

if f(x) = 2e, then f'(x) = 3 x 2e* = Ge: 

if f(x) = e *, then f’(x) = (—1) x e7” = -e7?; 


1 
if f(x) = 3e*/3, then f'(x) = 3% Sens = 2/8, 


x 


The second of these examples, the derivative of e~”, which results in 


minus e 7, is particularly useful in statistics. 





Activity 13 Differentiating exponentials 


Find the derivatives of each of the following functions. 


(a) 6e®/? (b) 3e78" (c) 10e-0-1 


If you are unsure about the basic differentiation methods that you have 
just worked through, Screencast 7.3 might be of assistance. 
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0) Screencast 7.3 Differentiating a polynomial plus an exponential 


You couldn’t make such an 
expansion if the power were, 
say, 8.5. 
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Differentiating functions of functions, and products 


You will also need to use the rules for differentiating certain functions of 
functions and, since the likelihood is a product of functions, for 
differentiating products of functions. Let us start with functions of 
functions. 


Suppose that you need to differentiate the function 
f(x) = (2x +1. 


This is a power of a polynomial. Now, you could, in this case, expand the 
power, thereby writing f(x) as a long polynomial. But it turns out to be 
much easier to treat f(x) as a function — the power 8 — of another function 
— the polynomial 2x + 1. The result for differentiating a function of a 
function is known as the chain rule. 


The chain rule 


Suppose that the function f(x) can be written in terms of other 
functions g and h as 


f(z) = k(g(x)). (8) 
Then 


F(z) = g'(x) '(g(x)). 


We will use the chain rule particularly in the case of powers of polynomials. 





Example 14 Differentiating powers of polynomials 
To illustrate applying this rule, consider differentiating 
f(z) = (2x + 1)®. 
This is of the form of Equation (8) if we set 
h(y)=y® and y=g(x) = 22 +1. 
Now, 
h'(y) = 8y’ and g'(x) =2. 
It follows that 
f' (x) =2 x By" = 2 x 8(2x +1)’ = 16(2x +1)’. 


Activity 14 Differentiating powers of polynomials 


Find the derivatives of each of the following functions. 


3 Using calculus to find maximum likelihood estimates 


2 4 
a — g)k T = 
(a) (1— x) @)12(1+2 +) 


And finally, what about a product of functions? 


The derivative of a product of two functions 


Suppose that f(x) = g(x) x h(x), where g(x) and h(x) are any 
functions of x. Then 


f'(a) = g'(a) h(x) + g(a) h'(2). (9) 





Example 15 Differentiating a product 
Suppose that 
f(x) = 3x(2x +1}. 
Then f(x) can be written as g(x) h(x) where 
g(x)=3x and A(x) = (2x +1}. 
Now, 
g(x) =3 and, from Example 14, h/(x) = 16(2x +1)’. 
It follows from Equation (9) that 
f'(x) =3 x (2x + 1) + 3x x 16(22 4-1)". 


We would then normally simplify the expression for f'(x) by extracting 
common factors, so that 


fi (x) = 3(2x +1) (22 +1+ 162) = 3(2x + 1) (182 + 1). 





Activity 15 Differentiating products 


Find the derivatives of each of the following functions. 


(a) z?(1— r)? (b) xe 


The chain rule and the rule for differentiating a product of two functions 
are put to good use in an example in Screencast 7.4. 


Screencast 7.4 Differentiating a power of a polynomial times 
an exponential 


This will be particularly 
important in what follows 
because likelihoods are products 
of simpler functions. 


It doesn’t matter which function 
you call g and which you call h. 


Extracting common factors in 
this way will be particularly 
helpful when finding MLEs. 





In marketing, product 
differentiation is the process of 
making your product stand out 
from the others 


® 
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3.2 Maximum likelihood estimates 


In Example 8, we formed the likelihood for fitting a binomial distribution 
with parameter 0 to particular observations on the rolling of a biased die. 
The likelihood in that case is 


L(6) = 1200" (1-0). 


A graph of this function is given in Figure 7, a repeat of Figure 1 but with 
its maximum clearly marked. 












maximum 
value of 















0.34 

0.24 
D } 
J point 

0.14 at which 

maximum 
occurs 
0 
0 0.5 1 
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Figure 7 The likelihood L(0) for 0 < 6 < 1 with its maximum marked 


Recall that the equation of the slope, or gradient, of a function of one 
variable is the derivative of that function. For example, the gradient of the 
likelihood L(8), which is a function of the variable 9, is its derivative L’(6). 


Now, as you can see from Figure 7, the maximum of the function L(6) 
occurs at a point at which the function is (temporarily) flat, that is, where 
its slope, or gradient, is zero. At such a point — referred to in general as a 
stationary point — the derivative of the curve is zero, that is, L’(@) = 0. 


Figure 7 also illustrates the following important properties that are often 
possessed by a likelihood. 


Property 1. The likelihood L(0) is a smooth function of 4 (see below). 
Property 2. There is exactly one stationary point. 


Property 3. The function is increasing before the stationary point — that 
is, its gradient, and therefore its derivative, is positive before the 
stationary point — and is decreasing after the stationary point — that 
is, has negative gradient, and therefore derivative, after the stationary 
point. 

Property 4. The likelihood attains its maximum at the stationary point, 
so the stationary point is the maximum likelihood estimate. 


When the likelihood has Properties 1, 2 and 3, Property 4 follows 
automatically. Thus checking Properties 1, 2 and 3 will often identify the 
MLE. Indeed, when trying to find the MLE of a single parameter, this is 
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the approach that is normally tried first — and it generally works except 
with models that are quite complex or are unusual in some way. 


Although ‘smoothness’ of a function can be given various mathematical 
definitions involving its continuity and the behaviour of its derivatives, we 
need not do so here. In this module, you may assume that all the 
likelihoods you encounter are smooth, and hence that Property 1 holds. 


Property 2 can be more usefully expressed in terms of the derivative of the 
likelihood function. It is equivalent to: 


Property 2*. The equation L’(@) = 0 has a single solution, say 0 = 6*. 
We can therefore, for example, locate the value of 6* for the likelihood 
shown in Figure 7 by solving the equation L/(0) = 0 when 
L(6) = 120 07(1 — 0)’. Using Equation (9) to differentiate a product and 
Equation (8) to differentiate the second term in the product, we have 
L'(0) = 120 {70° x (1 — 0)? + 6” x (—1) x 3(1 — 6)?} 

= 120 {76°(1 — 0)? — 307(1 — 6)?} 

= 1206°(1 — 6)? {7(1 — 6) — 36} 

= 1206°(1 — 9)? (7 — 108). 
Now, since 0 < 4 < 1, the term 1206°(1 — 6)? is positive; call it K(8), say. 
So when we set L’/(@) = 0, we have 


K(0)(7 — 100) = 0. 








Thus the value 6* must satisfy 


7 — 10% =0. 
This is easily solved to yield 
7 
À = — =0. 
10 pe 


What of Property 3? Well, this can be checked, more or less formally, in 
any of three ways: 


e look at the graph of L(6) 
e explicitly check that L’(@) > 0 for 0 < 0* and L/(0) < 0 for 0 > & 


e check an equivalent formulation of Property 3 in terms of the sign of 
the second derivative of L(@) (that is, the derivative of the derivative 
L'(0)) at 6. 


You will not be asked to make any of these checks in this module. Indeed, Things that can go wrong with 
the only thing you need check is that there is exactly one solution of the this approach outside this 


equation L/(8) = 0 for ‘allowed’ values of 4. If so, you can take it that this module include that a single 
stationary point might be a 


minimum or another kind of 
6 0.7 stationary point, or that the 
7:7 likelihood has multiple maxima. 


solution, 6*, is also the MLE 0. So, for example, 


is the MLE of 0 in the example of the rolling of a biased die. (This result 
was mentioned in Example 8.) 
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The key, therefore, to using differentiation to obtain exact values for MLEs 
is Property 2*: differentiate L(@), and find the solution of L’(@) = 0. The 
following box gives the full set of steps to follow when looking for an MLE 
of a single parameter 0 in M248. 


Finding the MLE of 0 


Step 1. Form the likelihood L(@) as a product of simpler terms, as in 
Equation (7). 

Step 2. Differentiate L(@) to obtain L/(@). 

Step 3. Solve the equation L'(0) = 0. If there is exactly one solution, 
then set the MLE 0 equal to that solution. 


The above procedure is used for both discrete probability models and 
continuous probability models. Here are some examples and activities. 





Example 16 Estimating the parameter of a geometric distribution again 


In Example 9, a very small artificial dataset of three observations, xı = 3, 
z2 = 4 and x3 = 8, was considered. Assuming that these are independent 
observations from a geometric distribution with unknown parameter 0, 

0 <0 <1, the likelihood for 0 was shown to be 


L(6) = (1 — 8) 26. 
Step 1 of finding the MLE is therefore (already) completed. 


Step 2 asks us to obtain L/(0). Using Equation (9) to differentiate a 
product and Equation (8) to differentiate the first term in the product, 
this is 
L(6) =(-1) x 12(1 — 0)" x @ + (1 — 0)" x 307 
= —12(1 — 0)" + 3(1 — 9) 26 
= 3(1 — 8) 16 {—40 + (1 — 6)} 
= 3(1 — 0)? (1 — 58). 
Step 3 asks us to solve L/(0) = 0. Since for 0 < 4 < 1 we have 
3(1 — 6)"'6? > 0, this reduces to solving the linear equation 
1—50=0. 
This obviously has a single solution, namely the MLE 





Example 17 Sparrow nests 


For each of 40 plots of land, the number of sparrow nests in that plot was 
recorded. The data are given in the following table. 
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Table 8 Numbers of sparrow nests in each of 40 plots 


Number of nests 0 1 23 >3 
Observed frequency 9 22 7 2 0 


Assume that the number of nests in a plot follows a Poisson distribution, 
and let 6 denote the mean of the distribution; note that 0 > 0. We want to 
obtain the MLE of 0. First, we need the likelihood. The Poisson p.m.f. is 











e967 According to a recent British 
p(x; 0) = A Trust for Ornithology report, 
e the sparrow population in the 
Thus the likelihood is UK has declined by nearly half 
L(0) = p(0; 0) x p(l; 9)? x p(2: 0) x p(3: 0)? since the 1970s 
—0 —0 
o o € "0 ee 
=e "X-+::xe x- teem Ti 
ti — 
PS 22 times 
e7? 2 e 96? e 96° e 0 
x ——— XX SX X —— 
2! 2! 3! 3! 
S ne” 
7 times 








_n9 [0N (e982 \" eB? 
= (A (FF) (S") (4) 
e 90—220—70 20 922+14+6 
= ~~ g Recall that 
400742 etel... ef = et tht + and 
e 0 (er ÿe = et. 





4608 ` 
Using Equation (9) to differentiate a product, we have 
i 1 
LÉ) = ENT 
“À 
~ 2304 
All but the linear term in brackets is irrelevant to solving L/(4) = 0 
because for 0 > 0, e~4°%9"" /2304 > 0. The linear term has a single value of e” > 0 for any value of z. 
0 at which it is zero, so that value is the MLE 0: 6 satisfies 


(406710 x (42 + e740 x 4261) 


PU 


—200 + 21 = 0, 
SO 
a 9] 
ĝ = & = 1.05. 
20 





Activity 16 Flood frequency 
Example 10 concerned the frequency of floods in a town. It was assumed 


that the time between floods is an observation from an exponential 
distribution with parameter 0 > 0. The data took values 5, 8 and 7 years. 
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It was shown that the likelihood for @ is 
Li) = Pe% 
(a) Determine L/(8). 
(b) Hence show that the MLE of 0 takes the value 0.15. 


Activity 17 = /ndustrial component inspections 


Table 9 gives the frequencies of counts of the numbers of inspections of 
batches of industrial components which find no problem up to and 
including an inspection which finds one or more problems. Notice that 
counts (such as 6 or 15 inspections) which have no occurrences in the 
dataset are omitted from the table. In this dataset, n = 28. 


Table 9 Counts of industrial inspections up to and including one that 
finds a problem 





Inspecting non-industrial Count 1 2 3 4 5 7 9 11 13 14 17 18 26 29 
1 


components coming off a Frequency 6 4 3 3 2 1 1 À 2 À 1 1 1 
production line 


(Source: Bracquemond, C., Crétois, E. and Gaudoin, O., ‘A comparative study of 
goodness-of-fit for the geometric distribution and application to discrete time 
reliability’, Undated technical report) 


A good model for these data would appear to be a geometric distribution 
with parameter 0; here, 0 < 0 < 1. 


(a) Show that the likelihood for these data is 
L(6) = (1 = DSP, 

(b) Determine L/(6). 

(c) Hence find the MLE of 0. 


Exercises on Section 3 





Exercise 3 Practice with differentiation 


Find the derivatives of the following functions. 





Exercise 4 Clay pigeon shooting 


In Exercise 2, the numbers of shots that a man took to hit four clay 


itä targets were given (3, 1, 2 and 2) and modelled by a geometric distribution 
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with parameter 0, 0 < 0 < 1. The likelihood for 0 was shown in 
Exercise 2(a) to be 


L(0) = le. 
(a) Determine L’ (0). 
(b) Hence find the MLE of 0. 


Exercise 5 Tyre machine failure times 


Table 10 gives the times from repair until failure (in hours) of a machine 
applying coatings to tyres in a factory in Iraq. Here, n = 22. 


Table 10 Times until failure of machine (in hours) 


3.5 6.5 10.5 23.25 23.5 43.5 69 70 
75.5 83.25 95.5 109.5 111.25 144 164 167.25 
253 383.75 417.75 428.25 453 1215 


(Source: Al-Jammal, Z.Y. (2008) ‘Exponentiated exponential distribution as a 
failure time distribution’, Iraqi Journal of Statistical Science, vol. 14, pp. 63-75) 


A good model for these data would appear to be an exponential 
distribution with parameter 0 > 0. 


(a) Show that the likelihood for these data is 
L(@) — 022 e74350.75 0. 

(b) Determine L’ (0). 

(c) Hence find the MLE of 8. 





4 Maximum likelihood estimators and 
their properties 


In this section, we use differentiation to find maximum likelihood 
estimators of model parameters (not just specific maximum likelihood 
estimates). That is, we use the maximum likelihood approach to derive 
estimating formulas for parameters — maximum likelihood estimators — 
rather than just estimates for specific datasets. After doing so in 
Subsection 4.1, in Subsection 4.2 we briefly outline some attractive 
properties of maximum likelihood estimators. 


4.1 Maximum likelihood estimators 


In two examples in Subsection 3.2, we used sample data to determine the 
maximum likelihood estimate of the parameter of a geometric distribution. 
The first example (Example 16) was an artificial one with n = 3 and data 


values 3, 4 and 8. For this dataset, the sample mean is 





The abbreviation MLE is used 
to denote both maximum 
likelihood estimates and 
maximum likelihood estimators. 
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MLE Pyrotechnics of Daventry: 
providing the fireworks in M248? 


Remember that an estimator is 
a formula, and an estimate is its 
value. 
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T = (3 + 4 + 8)/3 = 15/3 = 5 and, from Example 16, the MLE of 6 is 
6 = 0.2 = 1/5. So, in that case, the MLE is the reciprocal of the sample 
mean. The second example (Activity 17) concerned inspections of 
industrial components; the data were given in Table 9. For this dataset, 
the sample mean is 
6x1+4x2+3x3+-:-+1x29 203 

28 287 
also, from Activity 17, the MLE of @ is Ô = 28 /203 which, again, is the 
reciprocal of the sample mean. 


T= 


Obviously, it would be helpful to know if the MLE of the parameter of a 
geometric distribution is always equal to the reciprocal of the sample 
mean. If it is, then for geometric distributions, we would not need to use 
calculus to find the MLE, but could simply set the MLE equal to the 
reciprocal of the sample mean. In Example 18, we will show that this is 
the case. Indeed, for most standard distributions there are known formulas 
for the maximum likelihood estimators of the parameters of the 
distribution. Let’s investigate what these might be. 


To recall the definition of a likelihood, suppose that X1, X2,..., Xn is a 
random sample that takes values 71, £%2,..., £n. If these values are from a 
discrete distribution with probability mass function p(x:; 0), then the 
likelihood of 0 is 

L(0) = p(x1:0) x p(x; 0) x +++ x plan; 0). 
Similarly, if they are from a continuous distribution with probability 
density function f(x; 6), then 

L(0) = f(x1;0) x f(x2: 0) x +--+ x f(rn5 4). 


To obtain the maximum likelihood estimator, we first determine the 
maximum likelihood estimate. This will be a function of the observations 
T1,22,...,2n. Then the maximum likelihood estimator is obtained simply 
by replacing observed quantities by the corresponding random variables: 
the observation x; would be replaced by X1, x2 by Xo, and so forth. 


The approach is illustrated in the next examples and activities. 


Example 18 Maximum likelihood estimator for a geometric distribution 
Suppose that 71, £2,..., £n make up a random sample of observations from 
a geometric distribution with parameter 0, 0 < 0 < 1. The likelihood is 
L(0) = p(x1:0) x p(æ2:0) x +--+ x p(xn: 6) 
= (1—6)**-19 x (1 — 6)-16 x (1 — D) TD x --- x (1-6) 19 
= (1 — Q) Xi tig”, 
Now, >>, vi/n = F, so D, x; = nT. It follows that the likelihood can 
be written in a slightly simpler form based on the sample mean: 
L(@) = (1 — 6) "6". 


To obtain L/(8), we use Equation (9) to differentiate a product and 
Equation (8) to differentiate the first term in the product: 


4 Maximum likelihood estimators and their properties 


L' (0) = (—1) x (nz = n) (1 — DRE (= OT x ng”! 
= n(1— ga (1-9)} 
= n(1 pyr?" 1 1 — 78). 


Now, since 0 < 6 < 1, the multiplier n(1 — @)"*-"—!6""! > 0, so the only 
solution of L’ (8) = 0 is when 


1-7z0=0. 
The maximum likelihood estimate for this set of data is therefore 
~ 1 
0 = = 
T 


So, replacing the observed sample mean by its random variable version, the 
maximum likelihood estimator of the parameter 0 of a geometric 
distribution for any set of data is therefore given by 


x 1 
0 = =; 
X 
The MLE of the geometric parameter is indeed always equal to the 
reciprocal of the sample mean. 





Example 19 Maximum likelihood estimator for a Poisson distribution 


Suppose that z1, £2,..., £n is a random sample of observations from a 
Poisson distribution with parameter 0, 0 > 0. To obtain the maximum 
likelihood estimate of 0, we first form the likelihood: 


L(@) = p(z1;0) x p(£2;0) X «+> X pl£n;0) 
e 9971 e 0072 e—0grn 








zı! xə! Tn! 
e7"? x Prin ti 
Mi- t! 
Notice that 1/ Į [;—; x! is positive and does not depend on 6, so can be 


written as a positive constant, C, say. Also, using the fact that 
Du i = NZ, the likelihood can then be written more simply as 


Ib) = Ce "or, 
It follows that, differentiating the product (using Equation (9)), 
L'(8) = C(- ne The x nzo"™—*) 
= Cne MG 1( 0 + T). 


Since @ is positive, so is Cne~"°6""—!, and the only solution of L/(8) = 0 is 
when 


_0+7=0, 
namely 
=f. 


Again, replacing the observed sample mean by its random variable version, 
the maximum likelihood estimator of the parameter 0 of a Poisson 
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distribution for any set of data is therefore given by 
0=X. 


That is, the MLE of the Poisson parameter is always equal to the sample 
mean. 


In Example 17, we saw that the maximum likelihood estimate of 0 for the 
data on sparrow nests given in Table 8, assuming a Poisson distribution, is 
0 = 21/20 = 1.05. Instead of deriving this MLE from scratch, had we 
known what we now know, we could have declared the MLE to be the 
sample mean and just calculated the latter. From Table 8, 


9x0+22xK14+7x24+2x3 42 21 


=—=—=1.05. 
40 40 20 2 


t= 





You can work out the general form of the MLE for binomial and 
exponential parameters for yourself in Activities 18 and 19, respectively. 


Activity 18 Maximum likelihood estimator for a binomial 
distribution 


Suppose that n independent trials are conducted and the number of 
successes, X, is a random variable that follows the binomial distribution 
Bin; 0) 0< 8 <1, 


(a) What is the likelihood for 6 based on the single observation X = x? 
(b) Determine the maximum likelihood estimate of 0 when X = x. 


(c) Hence write down the maximum likelihood estimator of 0. 


Activity 19 Maximum likelihood estimator for an exponential 
distribution 


Suppose that 271, £2,..., £n is a random sample of observations from an 
exponential distribution with parameter 0, 0 > 0, these being the observed 
values of independent random variables X1, X2,..., Xn from this 
distribution. 


(a) What is the likelihood for 6? 
(b) Determine the maximum likelihood estimate of 0. 


(c) Hence write down the maximum likelihood estimator of 9. How does 
the maximum likelihood estimator depend on the sample mean? 


So we have found, for a number of standard distributions, formulas for the 
maximum likelihood estimators of the parameters of the distribution. 
When sample data come from such a distribution, applying these formulas 
is the quickest and easiest way of obtaining a maximum likelihood estimate 
— just feed sample values into the formula for the estimator to obtain the 
estimate. 


4 Maximum likelihood estimators and their properties 


Table 11 contains a list of standard results for maximum likelihood 
estimators for the parameters — in their more usual notation, instead of 
using 0 everywhere — of some of the more well-known probability models. 
In all but one case, these estimators assume a random sample 

X1,X0,..., Xn with sample mean X. The exception is the binomial 
distribution, whose estimator is based on a single observation X. The 
latter could have been made to match the others by thinking of X as being 
derived (as the total number of successes) from an underlying random 
sample of n observations from a Bernoulli distribution each with 
parameter p (n independent Bernoulli trials). For better or worse, Table 11 
follows the most standard way of presenting these results in statistics 
textbooks. 


The table also states whether or not the estimator is unbiased, that is, 
whether or not its average value is equal to the parameter it is estimating. 
You need not worry about proving all the results about bias; they are 
included for information only. 


Table 11 Maximum likelihood estimators (MLEs) for some 
standard probability models 


Probability distribution Estimator Properties 


Binomial, B(n, p) p= X/n E(p) =p 
Geometric, G(p) p=1/X P is biased 
Poisson(A) A=X E(A) = À 
Exponential, M (A) A= 1/X À is biased 
Normal, N (1,07) =X E= p 
G =5 (Xi — X)?/n_ G? is biased 


The first four MLEs listed in Table 11 were derived in Examples 18 and 19 
and Activities 18 and 19. The MLEs of the parameters of the normal 
distribution will not be derived here, partly because we have not been 
dealing with maximum likelihood estimation of two parameters at once in 
this unit. They are, however, important to recognise; for the normal 
distribution 
e the MLE of u is the sample mean, X 
e the MLE of o? is the estimator 
a 1 => 
a=W =) (X-X) 


n 
i=1 





(and not the unbiased estimator $?, which has divisor n — 1 in place 
of n). And this is the DJ MLE 


You should be reassured to observe that in almost all of the estimation 
problems connected with standard models summarised above, the method 
of maximum likelihood has led back to estimators that are identical to the 
estimators that we might have come up with on intuitive grounds. In fact, 
the estimators X/n for the binomial parameter and X for both Poisson 
and normal means were all considered in Section 1 of this unit, before we 
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embarked on our odyssey into maximum likelihood estimation. Also, the 
estimators 1/X were previously suggested in Subsection 1.2 of Unit 4 for 
the geometric parameter and in Subsection 2.2 of Unit 5 for the 
exponential parameter. 


The following activity requires you to use results from Table 11 to obtain 
maximum likelihood estimates. 


Activity 20 Maximum likelihood estimates for data from standard 
probability models 


Given each of the following samples, use Table 11 to determine the 
maximum likelihood estimate of the unknown parameter. 


(a) Observations 5, 3, 19, 9, 4, 7, 8, 3, 15, 5, 7, 4 from the geometric 
distribution, G(p). 


(b) Observations given in Table 1, and considered in Example 1, on counts 
of the leech Helobdella in 103 water samples, assumed to come from 
the Poisson distribution, Poisson(A). Hint: no new calculations are 
needed, so you can reuse a result that was obtained earlier in the unit. 


(c) Observations 0.131, 2.58, 0.04, 4.64, 1.70, 0.40, 4.28, 0.19 from the 
exponential distribution, M (A). 


Activity 21 Maximum likelihood estimates for data from the normal 
distribution 


(a) In Example 21 of Unit 6, a small sample of n = 9 Byzantine coins 
whose silver content had been measured was considered. (These were 
the coins from the first of four coinages.) On consideration of a normal 
probability plot, it was thought plausible that the silver contents could 
be modelled by a normal, N (u, 07), distribution. It turns out that the 
sample mean of these data is z ~ 6.7444% and the sample variance is 
s? ~ 0.2953 %?. 


(i) What is the maximum likelihood estimate of the parameter u? 


(ii) What is the maximum likelihood estimate of the parameter o°? 
Hint: you will have to work this out from the values of n and s?. 


(b) In Example 3, we revisited data on the chest circumferences of 
n = 5732 nineteenth-century Scottish soldiers (in inches). A normal 
distribution, N(j:,07), was deemed to be an appropriate model for 
these data. The sample mean is T ~ 39.8489 inches and the sample 
variance is s? ~ 4.2989 inches?. 


(i) What is the maximum likelihood estimate of the parameter u? 
(ii) What is the maximum likelihood estimate of the parameter o°? 


(c) Comment on the similarity or otherwise between the values of s2 and 
the MLE of o? for these two datasets. 


4 Maximum likelihood estimators and their properties 


4.2 Properties of maximum likelihood estimators 


In Subsection 1.2, the question was asked: ‘What makes a good 
estimator?’ It was suggested that an estimator is useful if it has low bias, 
indeed preferably no bias at all — that is, it is unbiased — and has a small 
variance. For most estimation methods, it is impossible to make general 
statements about the properties of the estimators they yield; the 
properties will vary with the underlying probability model. If the sample 
size is small, the same is true of maximum likelihood estimators. For large 
sample sizes, however, maximum likelihood estimators possess certain good 
properties. Results are said to hold asymptotically if they are 
approximately true provided that the sample size is large enough. 


The statistical theory behind the results in the following box is difficult, 
mathematically, and details will not be given here. You should accept 
these claims on trust. 


Properties of maximum likelihood estimators 


e Maximum likelihood estimators are sometimes unbiased and 
typically have small bias. Also, they are asymptotically 
unbiased; that is, 


~ 


E(0) — 0 as n — œ, 


where n is the sample size. It follows that maximum likelihood 
estimators are approximately unbiased for large sample sizes. 


e In addition, for maximum likelihood estimators, the variance 
V(0) tends to 0 with increasing sample size: 


A 


V (0) — 0 as n — co. 


Moreover, for large n, no unbiased estimator of 0 has a smaller 
variance than the maximum likelihood estimator. 


So maximum likelihood estimators possess the sorts of useful properties 
identified in Subsection 1.2. For instance, if they are not (exactly) 
unbiased for 0, then for large samples they are at least approximately 
unbiased for 0, and their variance becomes small. 





Example 20 Asymptotic unbiasedness of the MLE of a normal variance 


To illustrate how an estimator’s bias can decline with increasing sample 
size, consider the MLE of the variance (0?) of a normal distribution. From 
Table 11, the MLE is 6° = W = Y(X;— X)?/n. In Subsection 1.2 of this 
unit (Activity 5) you showed that 

n—1l 95 2 1» 


C=C =: 
n n 





E(6?) = 


The Central Limit Theorem of 


Unit 6 is an example of an 
asymptotic result. 


As before, the symbol ‘—’ is 
read as ‘tends to’. 
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This is consistent with the 
observation that 6° = W differs 
little from the unbiased 
estimator S$? in an example with 
large n in Activity 21(b). 


The Central Limit Theorem is 
very relevant! 


The log-likelihood can also avoid 
computational problems with 
the likelihood, associated with 
values of the likelihood 
sometimes being extremely 
small. 


Note that mo 4 7°, M2 4 X. 
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Hence the MLE of o? is biased, with a bias of —o?/n. Now, as n increases, 
1/n decreases, so the bias also gets smaller. Indeed, 1/n — 0 as n — ov, so 


—= 9 — 0 as n — oo, 
n 


and hence 
E(G?) > o? as n > co. 


Thus the MLE of o? is asymptotically unbiased, even though it has some 
(small) bias for finite sample sizes. 





It is also possible to state useful conclusions not just about the mean and 
variance of maximum likelihood estimators, but about their sampling 
distribution as well. Most importantly, maximum likelihood estimators are 
asymptotically normally distributed, provided that certain mild conditions 
apply. (The conditions are satisfied by most probability models.) However, 
this kind of result requires a certain amount of supporting theory before it 
can be confidently applied, and it will not be pursued further in this 
module. 


In other texts or modules, you will often see maximum likelihood 
estimation approached through ¢(@) = log{L(@)}, the so-called 
log-likelihood. This is very much a valid approach that can make the 
derivation of MLEs somewhat simpler — provided that you are comfortable 
with the rules for manipulating and differentiating logarithms. 


It has been shown that the method of maximum likelihood can often lead 
to estimators that are identical to the common sense estimators that we 
might guess without any supporting theory. A benefit of deriving an 
estimator by maximum likelihood is that it is then known to possess the 
above desirable properties. 


Exercises on Section 4 





Exercise 6 Maximum likelihood estimator for a Rayleigh distribution 


In Activity 10, the Rayleigh distribution was introduced. Its p.d.f. is 
fe) = ger", c>, 


with 0 > 0. Let us suppose that we have available a random sample 
£1, T2, ..., En, these being the observed values of independent random 
variables X1, X2,..., Xn from this distribution. Set 


C= |a, m=- z? and M= =D Xi. 
n 
i=1 i=1 i=1 
(a) Show that the likelihood for 0 can be written 


L(0) = CO Me nm20 0/2 


(b) Determine the maximum likelihood estimate of 0. 


Summary 


(c) Hence write down the maximum likelihood estimator of 0. 





Exercise 7 Douglas firs 


The ecologist E.C. Pielou was interested in the pattern of healthy and 
diseased trees in a plantation of Douglas firs. (The disease that was the 
subject of her research was ‘Armillaria root rot’.) Several narrow lines of 
trees through the plantation (called ‘transects’) were examined. After each 
diseased tree, X, the number of trees that had to be examined in order to 
find a healthy tree was counted. There were 109 such counts. Their 
frequency distribution is given in Table 12. 


Table 12 Numbers of trees examined to find a healthy tree 


Number of trees, X 1 2 3 4 5 6 
Frequency 71 28 5 2 2 1 


(Source: Pielou, E.C. (1963) ‘Runs of healthy and diseased trees in transects 
through an infected forest’, Biometrics, vol. 19, no. 4, pp. 603-14) 


(a) Assuming that X has a geometric distribution with parameter p, use 
information given in Table 11 to determine the maximum likelihood 
estimate of p. 





(b) In fact, information from a total of 166 trees is summarised in 
Table 12. Of these trees, 109 were healthy and 57 were unhealthy. 
Let Y be a random variable representing the number of healthy trees in 
a collection of 166 trees. Suppose that Y ~ B(166,p). Use information 
given in Table 11 to write down the maximum likelihood estimate of p. 


Douglas firs in Canada 


(c) Say whether each of the estimators in parts (a) and (b) is biased. Does 
this result surprise you? 





Summary 


There are various ways of obtaining estimating formulas — that is, 
estimators — of unknown model parameters; when an estimating formula is 
applied in a data context, the resulting number provides an estimate of the 
unknown parameter. The quality of an estimate can be assessed from the 
properties of the sampling distribution of the corresponding estimator. Not 
all estimating procedures are applicable in all data contexts, and not all 
estimating procedures are guaranteed always to give sensible estimates. 
Often, though, reasonable estimation methods give estimates that are 
either identical or very similar to one another. Qualities in an estimator 
that are desirable include unbiasedness and small variance. 
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One particular estimation method has been discussed in detail in this unit 
— the method of maximum likelihood. Maximum likelihood is one of the 
most useful estimation methods available because of its versatility and 
because it leads to estimators that have good properties. In particular, 
maximum likelihood estimators are both asymptotically unbiased and have 
variance tending to zero. The maximum likelihood estimate is the value of 
the parameter that is most likely to give rise to the observed data as 
measured by the likelihood function, or just likelihood. In many situations, 
the MLE can be obtained by taking the derivative of the likelihood and 
equating the derivative to zero. For most standard distributions, there are 
known formulas for the maximum likelihood estimators of the 
distribution’s parameters. 


Learning outcomes 


After you have worked through this unit, you should be able to: 
e realise that a parameter can have a variety of plausible estimators 


e compare estimators in simple contexts by examining their means and 
variances 


e understand the notion of unbiasedness and that it is a desirable 
quality in an estimator 


e appreciate that the sample variance is an unbiased estimator of the 
variance of any (not necessarily normal) distribution 


e acknowledge that if two estimators are both unbiased, then the one 
with the smaller variance is usually preferred, but also that estimators 
with both small bias and small variance can be useful 


e appreciate that the method of maximum likelihood is an important 
way of estimating a parameter 


e construct the likelihood L(0) associated with random samples from 
simple models 


e determine maximum likelihood estimates and estimators in 
well-behaved one-parameter problems through differentiation; this 
involves obtaining L’(#) and choosing the MLE @ to solve the equation 
L(G) =0 

e use standard results to obtain maximum likelihood estimates for 
common probability models 


e recognise that desirable qualities in estimators such as asymptotic 
unbiasedness and variance tending to zero are possessed by maximum 
likelihood estimators. 


Solutions to activities 


Solution to Activity 1 


When the population is Poisson(A), the variance g? is equal to the 
mean u, and both are equal to À. So if a random variable W was based on 
a dataset of size n from the Poisson distribution with parameter A, then 


Now, the random variable X was based on a dataset of size 103, the 
random variable Y on a dataset of size 48. So 


BX) =EY) =A; 
but 
_ À 
103’ _ 48 
The former is smaller than the latter. This is a particular example of a 


phenomenon you saw in Unit 6: the larger the sample that is taken, the 
smaller is the variance of the sample mean. 


V(X) = V(Y) 


Solution to Activity 2 
For the binomial distribution, E(X) = np. Thus 


2(*) _ EX) w, 





n n n 


Therefore X/n is an unbiased estimator of p. 


Solution to Activity 3 


(a) For any probability model, the sample mean is an unbiased estimator 
of the population mean. So the sample mean is an unbiased estimator 
of u. 
The variance of the sample mean is o?/n (Subsection 6.1 of Unit 6), 
which is 25/12 in this case. 


(b) As n increases, o?/n decreases, so the variance of the sample mean 
would decrease — and the sample mean becomes a better estimator of 
the population mean — if the sample size were increased. 


Solutions to activities 
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Solution to Activity 4 
(a) E(fiz) = E{qz (6X1 + 3X2 + 2X3)} 
= LE(6X) + 3X2 + 2X3) 
= + {E(6X1) + E(3X2) + E(2X3)} 
= {GE (X:) + 3E(X2) + 2E(X3)} 
= $ (6p + 34 + 24) = p. 
Hence f, is an unbiased estimator of p. 
(b) Vs) = V{ qq (6X1 +3X2 + 2X3)} 
= (4) V(6X, + 3X2 + 2X3) 
= (4) {V(6X1) + V(BX2) + V(2X3)} 
= (4)? {6 VC) +82 V (2) + 2 V(X3)} 
= (86 x 1+9 x 4+4 x 9) = BE ~ 0.89. 
This is greater than V (f) ~ 0.73. 





(c) The estimator f, is preferred to fiz because, while both are unbiased 
estimators of u, the variance of ji. is smaller than the variance of fiz. 


Solution to Activity 5 


1 
=—x(n-1)o £ 0°, 
n 


so W is a biased estimator of o?. 
(b) The bias of W is 
n=1 4 2 _ N—l-n > 12 


E(W)-o? = og —o = ——0*. 
n n n 





The estimator W is therefore negatively biased, meaning that on 
average, W tends to underestimate the value of 07. 


Solution to Activity 6 


(a) The number of mice with adenomas in a sample of size 54 has a 
binomial distribution B(54, 0). Given that six mice in the sample had 
adenomas, the likelihood of @ is 


L(0) = p(6; 0) = a (106) = 25 827 165 6° (1 — 0)*8. 


(b) L(0.11) = 25 827 165 (0.11)°(0.89)*8 ~ 0.1703, 
L(0.12) = 25 827 165 (0.12)5(0.88)* ~ 0.1669. 
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Solutions to activities 


These give the following table. 
Table 13 


0 0.09 0.10 0.11 0.12 0.13 
L(0) 0.1484 0.1643 0.1703 0.1669 0.1558 


(c) A graph of L(@) is shown in Figure 8. 


0.17- 


0.15- 








0.09 0.1 0.11 0.12 0.13 


Figure 8 A graph of L(0) for 0.09 < 0 < 0.13 


(d) From the position of the peak of the curve, ĝ is a little greater than 
0.11, but much smaller than 0.12. So 9 ~ 0.11. (The exact value of 6 
is, in fact, § œ 0.111.) 
Solution to Activity 7 
(a) The likelihood for this random sample is given by 
L(8) = p(x1;0) x p(x2:0) x p(x3: 0) x p(xa; 6) 
= (1-6) "6 x (1-0)? 16 x (1 — 9) 1@ x (1-0)? 16 
= (1 _ gjor IOa gi 
= (1 E 06t, 
as required. 


(b) From Figure 3, the likelihood appears to be maximised when @ is just 
below 0.6. That is, 0 ~ 0.58, say. (The exact MLE turns out to be 
4 ~ 0.571 in this case.) 
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Solution to Activity 8 


(a) Each of the 187 instances of unvariegated and unfaded (denoted 0) 
offspring plants contributes p(0; 4) to the likelihood, resulting in 
p(0; 0) 87. Similarly, each of the 37 instances of variegated and 
unfaded (v) offspring plants contributes p(v; 0) to the likelihood, 
resulting in p(v;0)*7. And so on. The complete likelihood of 6 for the 
sample observed is therefore given by 


L(8) = p(0:8) 7 x pv; 0)°" x p(f; 0)? x puf; 0)’ 
9 187 3 37 3 35 1 31 
oder 
9 187 3 72 i 31 
a, a) Gees 


(b) From Figure 4, L(0) is maximised when 6 is somewhere above 0.05; 
whichever value you chose corresponding to this is the approximate 
MLE of 4. (A more precise estimate turns out to be 8 + 0.0584.) 

Solution to Activity 9 


(a) For the random sample zı = 25, £2 = 31 from the exponential 
distribution, the likelihood of à is 


L(0) = f (21; 0) x f(x2:0) = 0e 025 x be 91 
— 92e 7225+31) — 670-569. 


(b) Figure 6 shows that the MLE of @ is around perhaps 0.04. (Exactly, 
the maximiser turns out to be 0 = = œ 0.0357.) 
Solution to Activity 10 
(a) The likelihood L(@) is obtained as the product of values of the p.d.f.: 
L(0) = f (x1; 0) x f(x2:0) x ++. x f(x6:0) 
= f(22.2; 0) x f(2.8;0) x «++ x f(8.3;0) 








= 22.2 en 22:2" /26° x 28 ne ft x 4.0 eo 4.07/26" 
6? Ca 0? 
13.9 11.7 8.3 
x z e7 13-9? /26” x gli jae" x ao 
_ 335 621 2 457.8/6" 
g2 , 
as required. 
335 621 
(b) L(8.5) = GHz e74578/(8.5) © 4.178 x 1079, 
335 621 
L(9) = Sy eT 2 4172 x 10. 
These calculations complete the following table. 
Table 14 
0 8.25 8.50 8.75 9.00 9.25 


L(@) x 10° 4.048 4.178 4.216 4.172 4.059 
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(c) A graph of L(6) x 10° is shown in Figure 9. 


4.257 








Figure 9 A graph of L(@) x 10° for 8.25 < 8 < 9.25 


From the position of the peak of the curve, 0 = 8.75. (A more 
accurate value is 0 = 8.735, correct to three decimal places.) 


Solution to Activity 11 


In each case, let f(x) denote the function to be differentiated. The 
solutions below, and in other activities in this subsection, give all 
calculation steps in detail, but you can combine steps to shorten 
calculations if you are comfortable doing that. 


(a) f'(x) =o x 6x?! = 127. 
(b) f'(x) = 5.1 x 45171 = 20.4z4 1. 
(c) Since f(x) = 5274, 


20 
f'(x) = —4 x 5x7 1 = 207$ = = 
a 


(d) Since f(x) = —4x3/?, 
f(x) == x ( — 48/1) = -62'/? = 6x. 


(e) f'(x) = 27. 
(£) Since f(x) = —3271/?, 


NI © 





1 _ _ _ 
fiz) =—5 x (— 32 (1/2) se 


Solutions to activities 
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Solution to Activity 12 
od d d 


4) + = (32) + a) + 


=04+3+4 22? '+3x (-5 
= 34 27 — 152? + 142°. 


(b) Since f(x) = 4 — 3a71/? — 2x73, 


a 5a?) + — 
=1) +7 x 2x7 


Solution to Activity 13 
TE ; x 6e7/2 = 32/2, 
(b) f'(x) = —3 x 3e—3% = —9e- 3". 
(c) f'(x) = —0.1 x 10e ME = EE, 
Solution to Activity 14 
(a) Here, 
h(y)=y* and y=g(x) =1-x. 
For these functions, 
h'(y) = ky"! and g'(x) = —1. 
Therefore 
NÉE 2 = —1 x k(1 — Le 
(b) Here, 
h(y) =12y* and y= g(x) =1+2r + 2g7 1/2, 
For these functions, 
h'(y) = 48y? and g'(x) =2—273/?, 
Therefore 
f'(2) = (2 — 17?) x 48y? 
= (2 — x 2) x 48(1 + 2x + 2x7 1/?)’ 
1 


-a(-y 
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Solutions to activities 


Solution to Activity 15 
(a) Here, 
g(x) =2" and h(x)=(1-x}. 
For these functions, g'(x) = 2x and, by the chain rule, 
h'(x) = —1 x 3(1 — z} = —3(1 — 22)”. 
Therefore 
f'(a) = 2x x (1— x) +z? x {-3(1 — x)*} 
= a(1 — x)*{2(1 — x) — 3x} = z(1 — x)?’ (2 — 5x). 
(b) Here, 
glz)=xz and h(z)=e™. 
For these functions, 
g(z)=1 and hk'(z)=-—e™. 
Therefore 
f'x)=1xe *+xx(-e *)=e *(1—2). 
Solution to Activity 16 
(a) Using Equation (9) to differentiate a product, we have 
L' (0) = 30° re ML x (00e =e (3 206). 
(b) Since 6?e~2 > 0 for any value of 6, the only solution of L’ (8) = 0 


satisfies 
— 200 = 0. 
Therefore the MLE is 0 = 3 = 0. 15, as required. 


Solution to Activity 17 
(a) Since the p.m.f. of the geometric distribution is 
plz; 0) = (1 — 8)" 10, 
the likelihood is 
L(0) = p(1:0)° x p(2; 6)* x p(3; 0)? x p(4; 0)? 
x +++ x p(26; 0)! x p(29; 0)! 
=0x---x@x(1-9)Ox---x (1 — 8)0 
NA 
6 times 4 times 
x (1— 0)?0 x... x (1 — 0)70 x (1 — 0)? x --- x (1-0)? 


a 
3 times 3 times 


(1 — 6)?°6 x (1 — 0)786 
3X 2+3x3+:::-+25+28 964 Dre (lp) gs 





Lao 


as required. 
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(b) Using Equation (9) to differentiate a product and Equation (8) to 
differentiate the first term in the product, we have 
L'(8) = (—1) x 175(1 — 6)!" x 68 + (1 — 0)! x 286?" 
= (1 — 61746?" {1756 + 28(1 — 6)} 
= (1 — 0) 9"" (28 — 203 0). 
(c) Since 0 < 8 < 1, we have (1 — 0)!740°7 > 0, so the (only) solution of 
L' (0) = 0 is the solution of 


28 — 2030 = 0. 
Hence 
= 28 
6 = — ~ 0.138 
203 


Solution to Activity 18 


(a) The likelihood based on a single observation x from a discrete 
distribution is L(0) = p(x; 0). Using the formula for a binomial 
probability, we therefore have that 


L(0) = C) 6*(1—6)"-*. 


(b) To simplify the ensuing formulas, write C = (2) and note that C > 0; 
this is valid because (2) does not depend on 9. This enables us to write 
L(0) = CO (1 — 0)”. 


Then, differentiating the product (using Equation (9)) and using the 
chain rule (using Equation (8)) to differentiate the second term in the 
product, 


L' (0) = C {20° x (1— 0)""* +0 x (—1) x (n — x)(1 — 9) 1 
= C6"~1(1 — 6)"-*-1{2(1 — 6) — (n — x)0} 
= C0771 (1 — 0)"7®t (x — n8). 


Since 0 < 6 < 1, we have C6*~'(1 — 6)"~*—! > 0, so the only solution 
of L’(@) = 0 is the solution of 


x — nô = 0. 
Therefore the maximum likelihood estimate of 0 is 
Te 
n 
(c) Replacing x by X, the maximum likelihood estimator of @ is 
~ X 
0 = —. 
n 


Solution to Activity 19 
(a) The likelihood is 
L(0) = f(x1:0) x f(x2:0) x ++: x f(tn; 0) 
= peT: x peT? x... x Pe Fn 
= Oe (rittt ton) 
= Oe XiTi = Ore OnE 
(b) Using Equation (9) to differentiate a product, 
L' (0) = noT! xe Pe £6" x (ale Pr 
= ne me 
Since 6 is positive, so is n0"—le-0nT 
L'(8) = 0 is when 


, and the only solution of 


1— 6% = 0, 
that is, the maximum likelihood estimate of 6 is 
~ I 
0 == Sa 
T 


(c) The maximum likelihood estimator of 8 for the exponential 
distribution is 
~ 1 


X 
it is the reciprocal of the sample mean. 


Solution to Activity 20 
(a) From Table 11, the maximum likelihood estimate of p is p = 1/7. The 
sample mean is 
lo ase 
T= sn Nr = 89 À 7417. 
12 12 
So the MLE of p is p= $ © 0.135. 
(b) From Table 11, the maximum likelihood estimate of À is À = 3%. From 
Example 1, z ~ 0.816. So the MLE of À is À œ 0.816. 
(c) From Table 11, the maximum likelihood estimate of À is A=1 /z. The 
sample mean is 
0.131 + 2.58 +--- +0.19 13.961 
8 © 8 


So the MLE of À is A = gay © 0.573. 


~ 1.745. 


TE 


Solutions to activities 
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Solution to Activity 21 


(a) (i) From Table 11, the maximum likelihood estimate of y is 
i = T x 6.7444 %. 


(ii) From Table 11, the maximum likelihood estimate of o? is 


1 
e=W=-) (x; -2). 





that 








n n-li n 
and the MLE of o? is 
_ 8 
3? = W ~ 5 x 0.2958 ~ 0.2625 %°. 


(b) (i) The maximum likelihood estimate of u is à = % œ 39.8489 inches. 


(ii) As in part (a)(ii), the maximum likelihood estimate of a? is 


=i 
P =W = 2 
T} 
SO 
1 
Foa oe x 4.2989 ~ 4.2982 inches?. 
5732 


(c) The values of s? and 6” are fairly similar for the coinage data and very 
similar for the chest measurement data: for the coins, s? and 6° differ 
by 0.0328; for the chest measurements, s? and 6° differ by only 0.0007. 
(The means of both datasets are of broadly comparable size, so it is 
meaningful to compare the sizes of these differences across datasets.) 
The degree of similarity between s? and 6° is driven by the sample 
size. For the coin data, n is relatively small and the factor, (n — 1)/n, 
by which s? is multiplied to obtain g? is noticeably different from 1 (it 
is 8/9 ~ 0.8889); for the chest measurement data, n is pretty large and 
the factor by which s? is multiplied to obtain S$ is very close to 1 (it 
is 5731/5732 ~ 0.9998). 


Solutions to exercises 
Solutions to exercises 


Solution to Exercise 1 
(a) Xı — X3 is an unbiased estimator of pı — Ho because 


E(X; — X2) = E(X1) — E(X2) = ty — H2, 


the individual sample means being unbiased estimators of the See Subsection 3.3 of Unit 6. 
corresponding population means. 
(b) The variance of X; — X> is See Subsection 3.3 of Unit 6, 
= — => — noting that X, and X; are 
V(X1 = X2) = V(X1) + V(X2) independent. 
9.5? 8.2? 
= — + — vx 14.93. 
11 À 10 


(c) If the additional pig were put on the high-protein diet (making twelve 
pigs on that diet), then the variance of the estimator would be 


=e 9.5? 8.2? 
V(X, — X2) = — + — x 14.24. 
a Aa) = aa Pag 
Alternatively, if the additional pig were put on the low-protein diet 
(making eleven pigs on that diet), then the variance of the estimator 


would be 


It is good for an estimator to have a small variance — the smaller, the 
better. So, from this point of view, the extra pig should be put on the 
high-protein diet. 

Solution to Exercise 2 


(a) The probability mass function for the geometric distribution with 
parameter @ is 


p(z;6) = (1-6)*""0, «=1,2,.... 
So the likelihood for this particular random sample of size 4 is given by 
L(8) = p(3; 0) x p(1; 0) x p(2; 8) x p(2; 4) 


= (1-6)? "6 x (1 — 9) 76 x (1-0)? 16 x (1-0)? 10 
Z (1 _ g)2+0+1+1 94 
= (1 — 0)*6*. 
(b)  L(0.4) = (1 — 0.4)*(0.4)* = (0.6)*(0.4)* ~ 0.0033, 
L(0.5) = (1 — 0.5)4(0.5)* = (0.5)® ~ 0.0039. 
The completed table for L(0) is given below. 
Table 15 
0 0.3 0.4 0.5 0.6 0.7 
L(0) 0.0019 0.0033 0.0039 0.0033 0.0019 
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(c) A graph of L(6) is shown in Figure 10. 
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Figure 10 A graph of L(6) for 0.3 < 0 < 0.7 


(d) From the position of the peak of the curve, ĝ = 0.5. (In fact, 0.5 is the 
exact value of 6.) 
Solution to Exercise 3 
(a) Since f(x) =3+4x l—11x *, 
4 99 
! — —2 —6 _ 
f (£) =0—-4x * + 55x = te 
(b) Since f(x) = (1+ 2°)'/?, this is of the form of Equation (8) if we set 
h(y)=y"? and y= g(r) =1+2°. 
Now, 
1 
hi (y) = a and g'(x) = 3x°. 


It follows that 
32? 


f'(£) = 32? x sy? = 327 x Za Ja 2 = D 
(c) This function can be written as g(x) h(x) where 
gl£) =x! and A(x) = (1+ x)”. 
Now, 


1 
g'(x) = s 


and, by the chain rule (using Equation (8)), 
h'(x) = 1 x 10(1 +x)? = 10(1 + z)’. 
It follows from Equation (9) that 


řas pr 2 sata + 71/2 x 10(1+ x). 
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Noting that 


af LU 
f'(x) can be simplified to 


x£ 9 x 9 
f'(x) = Cto {-4(1 +2) + 10z} = Cre) (or =} 


(d) This function can be written as g(x) h(x) where 


g(x) =x and h(x) =e". 





Now, 
g'(z)=2x and h'(x) =—-e *. 
It follows from Equation (9) that 


f'(x) = 2x x e” +x? x(—e *)=xe *(2- x). 


Solution to Exercise 4 


(a) Using Equation (9) to differentiate a product and Equation (8) to 
differentiate the second term in the product, we have 


L'(8) = 46° x (1 — 6)* + 6* x (—1) x 4(1 — 8) 
= 48 (1 — 6)3{(1 — 6) — 6} = 48 (1 — 6)3(1 — 26). 


(b) Since 0 < 0 < 1, we have 46°(1 — 0)? > 0, so the (only) solution of 
L'(@) = 0 is the solution of 


1—20= 
Hence 
~ 1 
0 = = —=0.5. 
2 


Solution to Exercise 5 
(a) Since the p.d.f. of the exponential distribution is 
f(x;0) = 0e-™, 
the likelihood is 
LO =e 8? x 0e FT x as x pe AS 
= 22 ¢—9(8.5+6.5+--+1215) _ 22,—4350.756 
as required. 
(b) Using Equation (9) to differentiate a product, 
L'(0) = 2267! x ¢— 4350-758 4 022 x (_4350,75)@7 4350-75 8 
= 97! e— 4350.75 9/99 _ 4350.75 6). 


(c) Since 6 > 0, we have 6?'e~4950-75 > 0, so the (only) solution of 
L'(0) = 0 is the solution of 


22 — 4350.75 0 = 0. 


Solutions to exercises 
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Hence 
~ 22 
0 — 


œ (0. 1. 
4350.75 i 





Solution to Exercise 6 
(a) The likelihood is 


L(0) = p(£1;0) X p(£2;0) x +++ X plan; 0) 


Ti -p2 2 LQ _ n2 2 n r2 2 
= x1/20 x =e x3/20 HER RE zh /20 
— Bilan En Ep et /20 

g2?” 


Using the notation C and m2 defined in the question, and observing 
that 3°, x? = nmg, we have 


L(0) = Ca 2% e—nm2d 7/2 
as required. 


(b) The final term in the product that makes up L(@) can be 
differentiated using the chain rule (Equation (8)), giving 


d =nm207? —3 —nm26~2 
70° 207*/2 = nmMo20 3e 20 /2. 
So, using Equation (9) to differentiate the product, we have 
L'(0) = C À pan x e7 nm207?/2 à 9-2" x 2 —nm26-?/2 
dé do 


=C {-2n8 2" x e-"m20 7/2 g-an x ram” %e-nmna 12} 
= Ong 2"-3_—nm2d*/2 (—207 + mz) . 
Since, for 0 > 0, CnO~?"~3e-nm29*/2 > 0, L’(@) is equal to zero only 
when 
2902 4 mə = 0, 
that is, when 6? = m2/2 or 


m2 
6=,/—. 
2 


(Note that we take the positive square root because 0 > 0.) 


(c) The maximum likelihood estimator of 6 for the Rayleigh distribution 
is therefore 


= | Mə 
d= ER 
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Solution to Exercise 7 


(a) 


The maximum likelihood estimate of the parameter p for a geometric 
distribution is p= 1/7. For the given data, the sample mean is 


(71 x 1) + (28x 2)+---+(1x6) 166 


T = — = — 21.528. 
” 714-08 5404041 109 
Hence the maximum likelihood estimate of p is 
109 
D = — ~ 0.657 
P = 166 
If Y ~ B(166, p), then the maximum likelihood estimate of p is 
a y 109 
= — = — x 0.657. 
P = T66 166 


According to Table 11, the estimator in part (a) is biased, while the 
estimator in part (b) is unbiased. This may seem odd, as the 
estimator’s value is the same in each case. The explanation is that the 
modelling assumptions differ in the two cases. In part (a) the 
sampling model is a geometric distribution, while in part (b) it is a 
binomial distribution. 


Solutions to exercises 
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